Bug 219216
Summary: | [EMC/QLogic 5.1 bug] qla2xxx driver running IO on DM-MPIO devices cause "kernel: PCI-DMA: Out of SW-IOMMU space " | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Pan Haifeng <pan_haifeng> | ||||||||||||
Component: | kernel-xen | Assignee: | Rik van Riel <riel> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | |||||||||||||
Severity: | urgent | Docs Contact: | |||||||||||||
Priority: | urgent | ||||||||||||||
Version: | 5.0 | CC: | andrew.vasquez, andriusb, berthiaume_wayne, coldwell, coughlan, ddutile, dzickus, mbarrow, mchristi, qlogic-redhat-ext, rkenna, sameer.shurpalekar, xen-maint | ||||||||||||
Target Milestone: | --- | Keywords: | OtherQA | ||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | All | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | RHBA-2007-0959 | Doc Type: | Bug Fix | ||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2007-11-07 19:16:42 UTC | Type: | --- | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | |||||||||||||||
Bug Blocks: | 216989, 217104, 227613, 252029 | ||||||||||||||
Attachments: |
|
Description
Pan Haifeng
2006-12-11 22:35:45 UTC
*** Bug 219219 has been marked as a duplicate of this bug. *** Given that the messages file wasn't attached to the bugzilla, I take it the entries in question are something like: PCI-DMA: Out of SW-IOMMU space for 4608 bytes at device 0000:02:0b.0 PCI-DMA: Out of SW-IOMMU space for 4608 bytes at device 0000:02:0b.0 PCI-DMA: Out of SW-IOMMU space for 4608 bytes at device 0000:02:0b.0 PCI-DMA: Out of SW-IOMMU space for 4608 bytes at device 0000:02:0b.0 If so, then as the message implies, the kernel has run out of IOMMU entries for the scatter-gather lists associated to a given command. qla2xxx will detect this via a failure of during the mapping call (pci_map_sg() or pci_map_single()) and will fail-out accordingly with the proper unmap() call and a SCSI_MLQUEUE_HOST_BUSY status returned during queuecommand(). qla2xxx performs no internal command queuing -- if internal driver resources are available (request-q entries), the command is immediately submitted to the RISC. So, all IOMMU entries mapped by the driver are in use and will be freed upon command completion. Are you seeing the same thing with a non-zen kernel? You are right, more logs like following: Dec 11 10:06:13 l82bi220 kernel: EXT3-fs: mounted filesystem with ordered data mode. Dec 11 10:12:38 l82bi220 kernel: PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:08:03.0 Dec 11 10:12:38 l82bi220 kernel: PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:08:03.0 Dec 11 10:12:38 l82bi220 kernel: PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:08:03.1 Dec 11 10:12:38 l82bi220 kernel: PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:08:03.1 Dec 11 10:12:38 l82bi220 kernel: PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:08:03.0 I did not see the same on on-xen kernel. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux major release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Major release. This request is not yet committed for inclusion. And I take it the PCI device at 0000:08:03.0/1 is the QLogic HBA? Could you attach the output of 'lspci -vvv'? Created attachment 143455 [details]
lspci log
This needs attention. Is this in the host or a guest? How much physical memory and how much assigned to the guest (if that's the case)? It is in a host OS not a guest OS. [root@l82bi220 ~]# cat /proc/meminfo MemTotal: 4031596 kB MemFree: 3239452 kB Buffers: 250120 kB Cached: 368280 kB SwapCached: 0 kB Active: 409028 kB Inactive: 265288 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 4031596 kB LowFree: 3239452 kB SwapTotal: 2031608 kB SwapFree: 2031608 kB Dirty: 224 kB Writeback: 96 kB AnonPages: 55780 kB Mapped: 22628 kB Slab: 79008 kB PageTables: 6356 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 4047404 kB Committed_AS: 140676 kB VmallocTotal: 34359738367 kB VmallocUsed: 4852 kB VmallocChunk: 34359732927 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 Hugepagesize: 2048 kB Looking at the code in detail, I'm no longer sure why exactly this is a blocker. The qla2xxx driver simply throws IO at the swiotlb as fast as it can, and backs off when the swiotlb is full. I can think of two things that would affect system performance negatively: 1) the number of printks from lib/swiotlb.c flying past :) (needs rate limiting?) 2) the fact that we're using the swiotlb at all, surely qla2xxx is capable of addressing memory >4GB ? Re: comment #10: > 2) the fact that we're using the swiotlb at all, surely qla2xxx is capable > of addressing memory >4GB ? Yes, qla2xxx can DMA above 4GB. The driver's employed the following logic for some time: static void qla2x00_config_dma_addressing(scsi_qla_host_t *ha) { /* Assume a 32bit DMA mask. */ ha->flags.enable_64bit_addressing = 0; if (!dma_set_mask(&ha->pdev->dev, DMA_64BIT_MASK)) { /* Any upper-dword bits set? */ if (MSD(dma_get_required_mask(&ha->pdev->dev)) && !pci_set_consistent_dma_mask(ha->pdev, DMA_64BIT_MASK)) { /* Ok, a 64bit DMA mask is applicable. */ ha->flags.enable_64bit_addressing = 1; ha->isp_ops.calc_req_entries = qla2x00_calc_iocbs_64; ha->isp_ops.build_iocbs = qla2x00_build_scsi_iocbs_64; return; } } dma_set_mask(&ha->pdev->dev, DMA_32BIT_MASK); pci_set_consistent_dma_mask(ha->pdev, DMA_32BIT_MASK); } According to the system's meminfo output: [root@l82bi220 ~]# cat /proc/meminfo MemTotal: 4031596 kB MemFree: 3239452 kB Buffers: 250120 kB there is less than 4GB on the system. Are you suggesting that dma_get_required_mask() is returning back no bits in the upper 32bit dword on the zen-kernel -- and thus causing the driver to fall back to 32bit dma-configuration? Andrew, could you please attach the /var/log/dmesg from booting up the Xen kernel on this system, so I can search it for any interesting messages? The swiotlb setup and other parts of the kernel should have something to say about what's going on here... Created attachment 143956 [details]
dmesg log after reboot
Created attachment 143957 [details]
syslog after reboot
The logs contain no kernel initialization messages, instead there are a slew of I/O failures due to some broken SCSI storage: ... sd 5:0:1:31: Device not ready: <6>: Current: sense key: Not Ready Additional sense: Logical unit not ready, manual intervention required end_request: I/O error, dev sddr, sector 0 sd 5:0:1:31: Device not ready: <6>: Current: sense key: Not Ready Additional sense: Logical unit not ready, manual intervention required end_request: I/O error, dev sddr, sector 0 sd 5:0:1:31: Device not ready: <6>: Current: sense key: Not Ready We'll need the early boot messages (dmesg -s 100000 might help, if the data was retrieved via 'dmesg' command). dmesg -s 100000 has same information. Can not get the required information using dmesg. Since we are only interested in the kernel boot-messages, could you disconnect the broken storage (all storage) and reboot the machine so that the circular-buffer does not wrap. 'dmesg -s 100000' should then at least be able to capture the relevant data. Created attachment 144043 [details]
dmesg output loading ZEN kernel
From: sprah_alex
Here is the log that Mike and I retrieved from Haifeng's system.
We have removed the FC cables from the HBA and rebooted the system.
-Alex
In taking a closer look at the driver messages as well, I can verify that given the logic mentioned in comment #11, only a 32 bit mask is being set: qla2xxx 0000:08:03.0: QLogic Fibre Channel HBA Driver: 8.01.07-k1 QLogic QLA2462 - PCI-X 2.0 to 4Gb FC, Dual Channel ISP2422: PCI-X Mode 1 (133 MHz) @ 0000:08:03.0 hdma-, host#=5, fw=4.00.23 [IP] Basically: 'hdma-' equates to a 32bit DMA mask being set. 'hdma+' equates to a 64bit DMA mask being set. So either: 1) dma_set_mask(&ha->pdev->dev, DMA_64BIT_MASK) is failing 2) or NO upper-dword bits are set in dma_get_required_mask() 3) or pci_set_consistent_dma_mask(ha->pdev, DMA_64BIT_MASK) is failing QLogic/EMC: we are at the point where this won't make RHEL5 unless a really low-risk patch is proposed... highly probable this will be deferred to 5.1. Rik, Given comment#20 and the kernel boot logs in comment#19, is there anything that qla2xxx is doing wrong in setting it's dma-mask? I have a feeling dma_get_required_mask() is returning a mask that has no upper-dword bits set. Would a ZEN kernel act in such a way even if the machine has less than 4gb of memory (this one I believe has 2gb)? Not that I know. Did you add any printks to the kernel on your test system to figure out exactly what is happening? I think we are at the point where this needs to be deferred to 5.1... Hi Andrius. We're able to easily reproduce this. As a result of today's conversation with QLogic and comment #20, QLogic will provide us with an instrumented driver that will, hopefully, answer all the questions and get to the bottom of this one. My fear is this problem is in a very popular configuration that we would not be able to support with this issue at release time if we can't fix it soon. I am mindful we are running out of runway on this as well. Regards, Wayne. Andrew @ QLogic: Any ideas in regard to Comment #27 from Rik? Wayne, thanks for the update... Given that more work is needed on this, I think we have run out of time in 5.0 on this one. Tom, your thoughts? Officially out of runway for 5.0. Deferred to 5.1. I've sent emc a debug driver which displays the 'required-mask' and return codes for the dma-calls. In testing locally with a snapshot6 xen kernel on an HP x86_64 machine with 2gb, I get the following: QLogic Fibre Channel HBA Driver PCI: Enabling device 0000:1f:00.0 (0140 -> 0143) ACPI: PCI Interrupt 0000:1f:00.0[A] -> GSI 16 (level, low) -> IRQ 16 qla2xxx 0000:1f:00.0: Found an ISP2432, irq 16, iobase 0xffffc20000020000 *** qla2x00_config_dma_addressing: required_mask set to 000000007fffffff. *** qla2x00_config_dma_addressing: required_mask has no high-dword bits set. *** qla2x00_config_dma_addressing: set consistent 64bit mask returned 0. *** qla2x00_config_dma_addressing: defaulting to 32bit mask/consistent-mask. qla2xxx 0000:1f:00.0: Configuring PCI space... Which tells me that a 32bit DMA mask is being set for dma_set_mask() and pci_set_consistent_dma_mask() since dma_get_required_mask() is returning back 7fffffff -- no upper-dword bits set... So again, I'm still a bit confused about with Rik's initial comments on 'double-buffering' given qla2xxx doesn't set anything less than a 32bit DMA mask... I've asked EMC to retry their failure machine with snapshot6 as the original results were apparently logged with snapshot2. Clarification -- not 'double buffering' but according to the 'required' DMA mask, a 32bit mask is sufficient... So why should the driver set a 64bit mask Andrew, the value of 0x7fffffff is consistent with the way dma_get_required_mask() works. u32 low_totalram = ((max_pfn - 1) << PAGE_SHIFT); u32 high_totalram = ((max_pfn - 1) >> (32 - PAGE_SHIFT)); 2GB of memory is 512k pages, which results in a low_totalram value of 2G and a high_totalram value of 0. After that we take this branch: if (!high_totalram) { /* convert to mask just covering totalram */ low_totalram = (1 << (fls(low_totalram) - 1)); low_totalram += low_totalram - 1; mask = low_totalram; This ends up setting low_totalram (and mask) to one less than 2GB, to be precise 0x7fffffff Are you saying that dma_get_required_mask() is doing the wrong thing? Andrew, since the qlogic driver seems to rely on the swiotlb running out of space to rate limit itself, would it be enough to simply put the printk from swiotlb_full() under a printk_ratelimit() or even disable it? Re: comment #36: That sounds reasonable to me. Patch proposed upstream: http://lkml.org/lkml/2007/6/1/207 Depending on feedback either this patch or a slightly changed one will be submitted for RHEL 5.1. Rik, Why not use printk_ratelimit ? deve_ack just the same, since a fix for this BZ is targeted for 5.1. Tom Fair enough Tom, I'll submit a patch with just a printk_ratelimit() for inclusion in the RHEL 5.1 kernel. This request was evaluated by Red Hat Kernel Team for inclusion in a Red Hat Enterprise Linux maintenance release, and has moved to bugzilla status POST. Argh. Turns out Xen has this jewel in lib/Makefile: swiotlb-$(CONFIG_XEN) := ../arch/i386/kernel/swiotlb.o That means x86-64 Xen is actually using arch/i386/kernel/swiotlb.o, and we most likely are running into the qla2xxx driver calling swiotlb_map_single() with a to-DMA area that straddles a page boundary. If there is an easy way to disable spanning page boundaries with a non-SG request in the qla2xxx driver, we will not have to bounce buffer the IO requests at all. Is there a way to achieve this? A related problem: dma_get_required_mask() is wrong if the Xen kernel is booted on a large system with a dom0 smaller than the maximum machine size. For example, think of a 16GB system which is booted with dom0_mem=2G. The dom0 kernel will think that it only has 2GB and will set a 32 bit DMA mask, even though the system has way more memory than that. Oops. Rik - are you waiting on QLogic for comments on this? According to Rik, the requests coming down from the block-layer are bordering a page-boundary, the DMA mapping in-turn is proceeding down a path which requires (incorrectly) the use of a bounce-buffer to manage the exchange. These semantics are all above the low-level driver (qla2xxx), the driver in this case can simply registers it's supported DMA mask, and rely on the upper-layers to efficiently manage the DMA pools. There's nothing more qla2xxx can do to address DMA mappings. Could someone @Qlogic try the following patch out to see if it corrects the problem? (sorry, don't have hw to test with). http://lists.xensource.com/archives/html/xen-changelog/2007-07/msg00093.html (In reply to comment #9) > It is in a host OS not a guest OS. > [root@l82bi220 ~]# cat /proc/meminfo > MemTotal: 4031596 kB > LowTotal: 4031596 kB Does is strike anybody else as odd that we are using bounce buffers when all of memory is low memory? Chip (In reply to comment #43) > Argh. Turns out Xen has this jewel in lib/Makefile: > > swiotlb-$(CONFIG_XEN) := ../arch/i386/kernel/swiotlb.o > > That means x86-64 Xen is actually using arch/i386/kernel/swiotlb.o, and we > most likely are running into the qla2xxx driver calling swiotlb_map_single() > with a to-DMA area that straddles a page boundary. It can't be swiotlb_map_single, because that function will panic after emitting the message (it calls swiotlb_full(hwdev, size, dir, 1), and that last argument set to 1 means it will panic). It must be swiotlb_map_sg that gets called. > If there is an easy way to disable spanning page boundaries with a non-SG > request in the qla2xxx driver, we will not have to bounce buffer the IO > requests at all. Is there a way to achieve this? I think this is wrong. It is definitely an SG request that is generating the messages. Chip Note to EMC (Wayne/Haifeng: can you please test the patch in Comment #51 ASAP? Does this solve the issue? Did following to apply the patch, compile the kernel and make new init image. The patch did not fix the issue from the observation. # wget kernel-2.6.18-8.el5.src.rpm # rpm -ivh kernel-2.6.18-8.el5.src.rpm # cd /usr/src/redhat/SPECS # rpmbuild -bp kernel-2.6.spec <--- this should apply the Xen patches # cd ../BUILD/kernel-2.6.18/linux-2.6.18 # patch -p1 < /extra/xenPatch.diff # make mrproper # cp configs/kernel-2.6.18-i686-xen.config .config # make # make modules_install && make install # mv /boot/initrd-2.6.18-8.el5xen.img /boot/initrd-2.6.18-8.el5xen.img.bak # mkinitrd -v /boot/initrd-2.6.18-8.el5xen.img 2.6.18-8.el5xen # vi /boot/grub/grub.conf # shutdown -r now After server boot up, run IO on the patched kernel, still has the error: Aug 9 14:45:45 l82bi220 last message repeated 3 times Aug 9 14:45:45 l82bi220 kernel: PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:08:03.1 Aug 9 14:45:45 l82bi220 kernel: PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:08:03.0 Aug 9 14:45:45 l82bi220 kernel: PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:08:03.0 Aug 9 14:45:45 l82bi220 kernel: PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:08:03.1 Aug 9 14:45:45 l82bi220 kernel: PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:08:03.1 Aug 9 14:45:45 l82bi220 kernel: PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:08:03.0 Aug 9 14:45:45 l82bi220 last message repeated 2 times Aug 9 14:45:58 l82bi220 kernel: PCI-DMA: Out of SW-IOMMU space for 16384 bytes at device 0000:08:03.0 OK, so the qla2xxx driver stuffs the SW-IOMMU full of sg requests with elements larger than page size :( I am not sure what we can do here, except maybe quiet down the printk... Marcus/Andrew - thoughts here? I am going to meet with Haifeng and learn how he reproduces this and try to do that at RedHat. It seems to me our handling in this area is standard If we can reproduce this at Red Hat, it will be easier for everyone to look at. Just for clarification here (not necessarily trying to beat this to death), but a SCSI LLD (low-level driver) is simply a transparent consumer of SG entries prepared and mapped by the upper-layers. qla2xxx doesn't manipulate sizes nor counts of SG entries. Again, I'm not entirely clear a LLD can 'do' something about this, if a request's SG list can't be mapped by the upper-layers, the I/O is simply flagged for retry. Created attachment 161206 [details]
quiet down the kernel
The qla2xxx driver seems to intentionally fill up the swiotlb (with requests
that don't fit in a page, so they need to be bounce buffered under Xen).
Unless the system has another driver that panics when the swiotlb is full,
there should be no bad side effects.
This trivial patch quiets down the kernel. If something panics, we will still
have enough error messages to figure out what went wrong. This patch should
not introduce any regressions and has been proposed for inclusion in RHEL 5.1.
This bug is for creating a workaround for the printk floods in RHEL 5.1, and bug 252029 has been created which is for a longer-term solution slated for RHEL 5.2. in 2.6.18-40.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Downloaded and install kernel-2.6.18-40.el5xen. Run IO on the new test kenrel for 4 hours, no error message out. [root@l82bi220 current]# uname -a Linux l82bi220.lss.emc.com 2.6.18-40.el5xen #1 SMP Tue Aug 14 18:12:49 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0959.html |