Description of problem: While booting ibm-ls21-7972-01.lab.boston.redhat.com with kernel-xen-2.6.18-79.el5 The system reports the following: PCI-DMA: Out of SW-IOMMU space for 57344 bytes at device 0000:03:04.0 PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0 PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0 printk: 44 messages suppressed. PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0 printk: 757 messages suppressed. PCI-DMA: Out of SW-IOMMU space for 32768 bytes at device 0000:03:04.0 Version-Release number of selected component (if applicable): kernel-xen-2.6.18-79.el5 How reproducible: Always Steps to Reproduce: 1. Install RHEL5.2-Server-20080212.0 on ibm-ls21-7972-01.lab.boston.redhat.com Actual results: printk: 67 messages suppressed. PCI-DMA: Out of SW-IOMMU space for 57344 bytes at device 0000:03:04.0 PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0 PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0 PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0 PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0 PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0 PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0 PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0 PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0 PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0 printk: 44 messages suppressed. PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0 printk: 757 messages suppressed. PCI-DMA: Out of SW-IOMMU space for 32768 bytes at device 0000:03:04.0 printk: 83 messages suppressed. PCI-DMA: Out of SW-IOMMU space for 61440 bytes at device 0000:03:04.0 printk: 65 messages suppressed. PCI-DMA: Out of SW-IOMMU space for 45056 bytes at device 0000:03:04.0 PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0 PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0 PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0 PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0 printk: 9 messages suppressed. PCI-DMA: Out of SW-IOMMU space for 12288 bytes at device 0000:03:04.0 printk: 77 messages suppressed. PCI-DMA: Out of SW-IOMMU space for 16384 bytes at device 0000:03:04.0 Expected results: We should not have to rate limit printks on a normal boot. This is a sign of a potential bigger problem. Additional info:
Created attachment 295354 [details] Boot log for kernel-xen-2.6.18-79.el5
Created attachment 295355 [details] Boot log for kernel-xen-2.6.18-53.el5
< Notes from Chip Coldwell > mptsas is causing this. We maybe wrong, but we don't know what other io device might be doing large chunks of DMA. Actually, not such a big assumption. The boot log has this: ACPI: PCI Interrupt 0000:03:04.0[A] -> GSI 19 (level, low) -> IRQ 16 mptbase: ioc0: Initiating bringup ioc0: LSISAS1064 A3: Capabilities={Initiator} scsi0 : ioc0: LSISAS1064 A3, FwRev=000a0f00h, Ports=1, MaxQ=511, IRQ=16 Vendor: IBM-ESXS Model: MAY2036RC Rev: T106 Type: Direct-Access ANSI SCSI revision: 05 and the error messages are PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:04.0 I think it's pretty clear that the device at 0000:03:04.0 is mptsas. -------------------------------------------------------------------------------- OK, mptsas_qcmd calls mptscsih_qcmd, which in turn will call either pci_map_sg or pci_map_single, which are #defines for dma_map_(sg|single). That's the end of the code path that leads to that error message. I cannot see anywhere else where mptsas is calling into the SW-IOMMU. mptsas_qcmd is installed as the .queuecommand method in the mptsas_driver_template (an instance of struct scsi_host_template). This gets called scsi_dispatch_cmd, itself called by scsi_request_fn. What this boils down to, is those requests are coming from I/Os submitted to the HBA. < End of notes from Chip Coldwell >
I'm getting quite a few of these errors during boot with kernel 2.6.18-82.el5-xen: PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:00:1f.2 ata1.00: exception Emask 0x0 SAct 0x40 SErr 0x0 action 0x0 ata1.00: cmd 61/00:30:d7:b7:7b/04:00:02:00:00/40 tag 6 ncq 524288 out res 40/00:1c:d7:a7:7b/00:00:02:00:00/40 Emask 0x40 (internal error) ata1.00: status: { DRDY } ata1.00: configured for UDMA/133 ata1: EH complete dmesg shows this for device 0000:00:1f.2: libata version 3.00 loaded. ahci 0000:00:1f.2: version 3.0 GSI 22 sharing vector 0xD0 and IRQ 22 ACPI: PCI Interrupt 0000:00:1f.2[C] -> GSI 20 (level, low) -> IRQ 22 ahci 0000:00:1f.2: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0xf impl SATA mode ahci 0000:00:1f.2: flags: 64bit ncq pm led clo pio slum part PCI: Setting latency timer of device 0000:00:1f.2 to 64 scsi0 : ahci scsi1 : ahci scsi2 : ahci scsi3 : ahci ata1: SATA max UDMA/133 abar m1024@0xff970000 port 0xff970100 irq 22 ata2: SATA max UDMA/133 abar m1024@0xff970000 port 0xff970180 irq 22 ata3: SATA max UDMA/133 abar m1024@0xff970000 port 0xff970200 irq 22 ata4: SATA max UDMA/133 abar m1024@0xff970000 port 0xff970280 irq 22 ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata1.00: ATA-7: SAMSUNG HD160JJ/P, ZM100-34, max UDMA7 ata1.00: 312500000 sectors, multi 8: LBA48 NCQ (depth 31/32) ata1.00: configured for UDMA/133
Created attachment 295526 [details] dmesg output for the machine in question
Jeff, for the later errors on PCI address 0000:00:1f.2 from comment #4, what's the controller in question? And do we know which was the latest kernel NOT to show these problems?
Stephen, There are 2 Jeff's on this BZ. I see that you put in NEEDINFO for me. But can you be a little more specific. Which comment are you asking about? Thanks, JeffB
Stephen, this is the chunk that causes the problem. It was added in the -70.el5 kernel by Bill. As you can see, all it does is enforce the dma restrictions, nothing to serious. The result is magnifying issues in the scsi layer it appears. Before, address_needs_mapping would fail because the whole 64-bit range was masked (which is expected). Now the code is checking to make sure the sg list is a chain of pages, which in the printks we are seeing are clearly not. @@ -529,7 +529,9 @@ swiotlb_map_sg(struct device *hwdev, struct scatterlist *sg, int nelems, for (i = 0; i < nelems; i++, sg++) { dev_addr = SG_ENT_PHYS_ADDRESS(sg); - if (address_needs_mapping(hwdev, dev_addr)) { + if (range_straddles_page_boundary(page_to_pseudophys(sg->page) + + sg->offset, sg->length) + || address_needs_mapping(hwdev, dev_addr)) { buffer.page = sg->page; buffer.offset = sg->offset; map = map_single(hwdev, buffer, sg->length, dir);
My question about the PCI address specifically referred to comment #4, so I'm asking Jeff M... But the question "do we know which was the latest kernel NOT to show these problems?" is a general request applicable to all of the instances of swiotlb in this BZ, so I'm leaving it open as NEEDINFO(reporter) in general, as I can't set the request to multiple people in BZ.
And to add yet _another_ person to the virtual NEEDINFO list... Don, do we have confirmation that backing out that one section eliminates the messages?
(In reply to comment #6) > Jeff, for the later errors on PCI address 0000:00:1f.2 from comment #4, what's > the controller in question? > 00:1f.2 SATA controller: Intel Corporation 82801GR/GH (ICH7 Family) SATA AHCI Controller (rev 01) (prog-if 01 [AHCI 1.0]) Subsystem: Dell Unknown device 01de Flags: bus master, 66MHz, medium devsel, latency 0, IRQ 22 I/O ports at fe00 [size=8] I/O ports at fe10 [size=4] I/O ports at fe20 [size=8] I/O ports at fe30 [size=4] I/O ports at fec0 [size=16] Memory at ff970000 (32-bit, non-prefetchable) [size=1K] Capabilities: [80] Message Signalled Interrupts: 64bit- Queue=0/0 Enable- Capabilities: [70] Power Management version 2 > And do we know which was the latest kernel NOT to show these problems? No, I haven't tried earlier kernels. Would you like me to narrow it down? It will mean rebooting my workstation.
(In reply to comment #10) no i haven't confirmed it yet. it seemed obvious, but then again i guess it could be a combination with another patch (even though -70.el5 is mostly xen patches).
(In reply to comment #8) > Stephen, this is the chunk that causes the problem. It was added in the -70.el5 > kernel by Bill. As you can see, all it does is enforce the dma restrictions, > nothing to serious. The result is magnifying issues in the scsi layer it appears. > > Before, address_needs_mapping would fail because the whole 64-bit range was > masked (which is expected). Now the code is checking to make sure the sg list > is a chain of pages, which in the printks we are seeing are clearly not. > > @@ -529,7 +529,9 @@ swiotlb_map_sg(struct device *hwdev, struct scatterlist *sg, > int nelems, > > for (i = 0; i < nelems; i++, sg++) { > dev_addr = SG_ENT_PHYS_ADDRESS(sg); > - if (address_needs_mapping(hwdev, dev_addr)) { > + if (range_straddles_page_boundary(page_to_pseudophys(sg->page) > + + sg->offset, sg->length) > + || address_needs_mapping(hwdev, dev_addr)) { > buffer.page = sg->page; > buffer.offset = sg->offset; > map = map_single(hwdev, buffer, sg->length, dir); Don, please include a patch name in the future so I don't have to go digging. Looking at the patch (linux-2.6-xen-handle-multi-page-segments-in-dma_map_sg.patch), the chunk you mention is applied in arch/i386. I am running on an x86_64 box.
JeffM, However, there is a *lot* of crossover between these two arches, especially in the Xen case. And, if you look in arch/x86_64/kernel/Makefile, you'll see that the i386 version is built in the x86_64 case as well. Another reason the i386/x86_64 upstream merge was good, but we have to live with it for RHEL-5. Chris Lalancette
Actually that isn't true. If you look in lib/Makefile you will find that not only do x86_64 and i386 share the same swiotlb.c file but it _differs_ from bare-metal. Which explains why you don't see it there.
I'm changing this back to ASSIGNED as all of the questions have been answered. Stephen, if you need to know if backing out that patch will fix things, then I'll kick off a build. Don seems convinced that the cause has been identified, though.
*** Bug 436265 has been marked as a duplicate of this bug. ***
*** Bug 436111 has been marked as a duplicate of this bug. ***
Created attachment 298053 [details] Don't perform unnecessarily swiotlb copies Possible fix: when we receive page-spanning scatter-gather segments which happen to be machine-contiguous already, don't copy them via swiotlb unnecessarily.
Fuller log for the fix, copied straight from the patch header: xen dma: avoid unnecessarily SWIOTLB bounce buffering. On Xen kernels, BIOVEC_PHYS_MERGEABLE permits merging of disk IOs that span multiple pages, provided that the pages are both pseudophysically- AND machine-contiguous --- (((bvec_to_phys((vec1)) + (vec1)->bv_len) == bvec_to_phys((vec2))) && \ ((bvec_to_pseudophys((vec1)) + (vec1)->bv_len) == \ bvec_to_pseudophys((vec2)))) However, this best-effort merging of adjacent pages can occur in regions of dom0 memory which just happen, by virtue of having been initially set up that way, to be machine-contiguous. Such pages which occur outside of a range created by xen_create_contiguous_ region won't be seen as contiguous by range_straddles_page_boundary(), so the pci-dma-xen.c code for dma_map_sg() will send these regions to the swiotlb for bounce buffering. In RHEL-5.1 this did not happen, because we did not have the check for range_straddles_page_boundary() in that code. Now that that check has been added, these SG ranges --- which ARE machine contiguous and which can perfectly well be sent to a dma engine --- are being bounce- buffered in the swiotlb instead, causing a performance overhead and potentially leading to early swiotlb exhaustion. This patch adds a new check, check_pages_physically_contiguous(), to the swiotlb_map_sg() code to capture these ranges and map them directly via virt_to_bus() mapping rather than through the swiotlb.
The patched kernel fixes the problem on my system; I no longer see any of the messages pertaining to SW-IOMMU exhaustion.
The 2.6.18-85.el5.swiotlbfix test kernel fixes the issue seen in RHTS as well.
*** Bug 438799 has been marked as a duplicate of this bug. ***
Created attachment 299461 [details] Detect physically-contiguous pages when determining if memory spans a page boundary Updates the previous patch (attachment 298053 [details]). The same test is still performed, but now in the core Xen dma layer, not in the swiotlb code, so the fix still works if we run with swiotlb=off.
Setting flags.
*** Bug 437031 has been marked as a duplicate of this bug. ***
*** Bug 440229 has been marked as a duplicate of this bug. ***
in kernel-2.6.18-89.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
*** Bug 441984 has been marked as a duplicate of this bug. ***
in kernel-2.6.18-90.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
Sorry, disregard previous comment
*** Bug 442347 has been marked as a duplicate of this bug. ***
*** Bug 442094 has been marked as a duplicate of this bug. ***
Adding QLogic and EMC to this bug.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html