Description of problem: If we have more than 4 LSI RAID adapters in the system running RHEL 3 U3 or U4 em64t, the driver fails to allocate either sglist or passthru after the 4th adapter (pci_alloc_consistent fails). Em64t failure seems to be related to the memory configuration, if we have more than 4GB of memory, driver fails to allocate memory after the 4th adapter. However if we lower the memory to < 4GB it works. Additional info: Decreasing the number of MAX_COMMANDS in megaraid2.h from 126 to i.e 64 regardless of the memory size seems to work around the issue. There are other problems we observed as well. With having more than 4 LSI RAID adapters, we have seen kernel panics caused by other drivers following the megaraid2 driver. This happens only with em64t kernel and when we have >=4GB of memory. I am not sure yet whether there is a megaraid2 issue or a kernel issue which happened to surface in these setups.
Created attachment 110581 [details] RHEL 3 U4 em64t with 4GB Tested latest driver from LSI and it fails the same way as the inbox driver
In RHEL3 U3 pci_alloc_consistent() for Intel em64t systems is getting allocated from ZONE_DMA(which is just 16MB) irrespective of dma mask. The workaround is use 'swiotlb=' boot option. If I set for example 'swiotl=32768/65536', then the inbox driver loads and finds all the adapters (7 adapters in our case). When swiotlb window size is 2MB, swiotlb buffers (which needs to be contiguous and get allocated during boot time) are also getting allocated from 16MB (falls into ZONE_DMA) and leaving less memory for consistent mappings.
The same problem causes a panic during the install. EL 3 U4 em64t installer panics if there are >= 4GB of memory and > RAID 4 adapters. Attaching the boot and the panic message.
Created attachment 111393 [details] the panic info RHEL 3 U4 em64t installer boot message and the panic info with 4GB and >4 RAID adapters
Can you work around the installer problem by passing the 'swiotlb=' boot option?
Unfortunately, no.
If we use nopobe and load the megaraid2 drivers manually, kernel does not panic but we don't see all the adapters. Using 'swiotlb=' in addition to noprobe does not change the behavior.
Created attachment 111839 [details] Patch fixing the issue Attached patch fixes this issue. This patch is a backport of 2.6 patch thats posted here. http://www.gelato.unsw.edu.au/linux-ia64/0410/11406.html Basically for pci_alloc_consistent, we remove the dependency on GFP_DMA by using GFP_NORMAL, and falling back to swiotlb if we find a memory address > 4GB.
Thanks for the patch. It looks good to me. Unfortunately, the deadline for RHEL 3 U5 has passed. This fix will probably have to wait for U6.
Suresh, isn't the swiotlb still located in ZONE_DMA though? So its size could consume all of ZONE_DMA. A secondary problem is that the scsi_malloc() pool can consume all of ZONE_DMA too, which is alleviated by decreasing the MAX_COMMANDS in each driver, IT67111 describes my approach to solving this.
Matt, default swiotlb size in U5 is increased to 64MB and because of this, its now getting allocated from above ZONE_DMA. About scsi_malloc() pool issue, I agree that it needs to be changed. I don't have access to IT67111. Can you send me the gist of it? Is it addressing the problem tracked by this bug or just the scsi_malloc() ZONE_DMA pool issue.
http://marc.theaimsgroup.com/?l=linux-scsi&m=111117365306386&w=2 is the scsi_malloc() pool issue patch I mentioned. It adds two parameters to scsi_mod: max_dma_memory=N where N is in megabytes which limits how much memory the scsi_malloc() pool can consume (by default 32MB, though that can only be reached on IA64 where ZONE_DMA is defined to be all of the first 4GB of space, otherwise it runs out well before 16MB. use_zone_normal=1 to force the pool to come from ZONE_NORMAL rather than ZONE_DMA. This isn't always safe, but when it is safe, it's a really good thing, else the pool can consume all of ZONE_DMA pretty easily.
-----Forwarded Message----- From: Susan S. Denham <sdenham> To: Matt Domsch <Matt_Domsch>, John Hull <John_Hull>, Dale Kaisner <dale_kaisner>, Amit Bhutani <amit_bhutani> Cc: Rob Landry <rlandry>, Larry Woodman <lwoodman>, peterm, jburke, Jay Turner <jkt> Subject: Please test: U5 test kernel (post U5 beta) that addresses IT 67111 and 50598 Date: 31 Mar 2005 16:04:16 -0500 Guys, My hero Larry Woodman has done up a post-beta U5 test kernel (these fixes are being considered for U5 GA) that he thinks kills two birds with one stone: - IT 67111 (Dell); IT 68265 (Intel); BZ 146954 - megaraid2 driver fails to recognize all LSI RAID adapters when there are more than 4 with >=4GB - IT 50598 - ata_piix doesn't find disks with > = 4GB RAM (Dell MUSTFIX) Please grab the kernel from the location below and let us know your test results ASAP. http://people.redhat.com/coughlan/RHEL3-swiotlb/kernel-2.4.21-31.swiotlb.EL.ia32e.rpm http://people.redhat.com/coughlan/RHEL3-swiotlb/kernel-2.4.21-31.swiotlb.EL.x86_64.rpm http://people.redhat.com/coughlan/RHEL3-swiotlb/kernel-smp-2.4.21-31.swiotlb.EL.x86_64.rpm Thanks, Sue Also sent Larry's test kernel to Levet Akyil of Intel on 3/31 who reports that it "looks good. On my setup, all modules loaded and worked as expected (I was able to see all the RAID adapters and no USB or ata-piix oops). I didn't test all the failing configurations though since some of them have to be reproduced in the validation labs but I am pretty sure this would work for those setups as well. We can test more thoroughly with the next U5 beta/RC drop."
Can someone fron Intel please explain whether or not 64 bit capable cards need to use buffers that are below the 4GB bouandry. If no, why are we checking for the 4GB boundary rather than the device dma_mask? Thanks, Larry Woodman
Thats because of the pci_alloc_consistent behavior. Documentation/DMA-mapping.txt says consistent DMA mapping interface will always return SAC addressable DMA address. Though I don't know the reason why this behavior is expected!
The desciption in DMA-mapping.txt seems to say that we should be trusting the dma_mask since it was established via a call to pci_set_dma_mask(). If this does the right thing then shouldnt we be checking the buffer against the dma_mask rather than 0xffffffff(4GB)? Larry
<snip> Consistent DMA mappings are always SAC addressable. That is to say, consistent DMA addresses given to the driver will always be in the low 32-bits of the PCI bus space. </snip>
Created attachment 112990 [details] Rewrite the patch so we dont fail if there is a swiotlb. I rewrote the patch so that it would call swiotlb_map_single() if the memory allocation fails. There is no sence in failing before we try to allocate from the swiotlb if one exits is there??? Larry
Larry, Changes look good to me. Thanks.
PM ACK for U6
Larry Troan, regarding comment #34, this is a RHEL3 bug. Why is building the patch into a RHEL4 kernel relevant?
I don't see comment #34. AFAIK, this is the issue only with RHEL3.
From User-Agent: XML-RPC Per Matt.... Finger check.... kenel-2.4.21-32.EL.smp RHEL3.... This event sent from IssueTracker by ltroan issue 73055
This bug is fixed in RHEL 3 U6 early release.
Please provide the kernel version where this fix will be available for Dell regression. Thanks
RHEL3 U6 does *not* contain a fix for this problem.
U6 is closed (and in beta already).
To bug reporter levent.akyil, does this bugzilla need to remain confidential to Intel? If not, could you please uncheck the "Intel Confidential Group" box below? Thanks in advance.
From User-Agent: XML-RPC Matt, can you elaborate on Dell's (your) results in testing the Red_Hat/Intel patch in Bug 146954 which Intel has made public. mdomsch assigned to issue for Dell-Engineering. Internal Status set to 'Waiting on Customer' Status set to: Waiting on Client This event sent from IssueTracker by ltroan issue 75976
Opening up comment #50 that requests Dell to elaborate on the prpblems they had in testing the Intel patch.
This Bug apparently is not a DUP of Bug 146789 (Engineering's call) but is tied to to it. It is believed that there is a common patch which will resolve both the problems described here and those described in bug 146789. Both bugs are now public.
I am still waiting to hear back from Dell and Intel as to whether or not the latest patch I posted works for everyone. Since there was some disagreement between Dell and Intel(Dell said the patch did not work but Intel said it did work) I had to pull it fron the RHEL3-U6 kernel. I would like to get this isue resolved but I need to know for sure where it does work and does not work so I can fix it if necessary. The sooner I get this feedback the soner I can fix it if necessary and get the patch into the RHEL3-U7 kernel. Larry Woodman
We are in the process of testing this patch. We'll provide test result ASAP.
Created attachment 118665 [details] Can Intel please grab and test this patch ASAP? I need Intel to grab this patch for inspection and testing ASAP. Thanks, Larry Woodman
Larry, An issue with patch in comment #55. We shouldn't use PCI_DMA_BIDIRECTIONAL. We should use PCI_DMA_FROMDEVICE while calling swiotlb_map_single() and PCI_DMA_TODEVICE while calling swiotlb_unmap_single(). This is to avoid memcpy's inside the swiotlb_map/unmap_single routines.
Created attachment 118668 [details] Map single panic
We tested the patch posted on 4/11 and the one posted today. We encountered kernel panic during boot when the system has >= 4GB meory. When we lowered the memory to 2GB, the system booted fine. See above attached kernel panic message.
OK, can you experiment around with "swiotlb=<size>" on the boot line where size is actual SWIOTLB size/2KB ? By default is is set to 32768 which gives us a 64MB SWIOTLB. Please try doubling that(65536), I suspect some driver asks for more as the system memory increases. Thaks, Larry
Yes, we experimented around with âswiotlb=<size>â and did other testings also: 1. Increased swiotlb to 65536 and 131072 in 2.4.21-37 with 4/11 patch ----- still encountered kernel panic with MPT driver as described above in comment #57; 2. Used the MPT 2.05.16.02 driver found in 2.4.21-32 kernel in 2.4.21-37 with 4/11 patch ----- kernel booted up fine; 3. Increased IO_TLB_SEGSIZE to 256 in 2.4.21-37 with 4/11 patch ----- kernel booted up fine, but rmmoding and re-insmoding megaraid2 driver resulted in a kernel panic with null pointer dereference in __list_del() in scsi_softirq_handler(). Donât know if this is related to swiotlb changes though.
Thanks Dely for the update. Larry, Newer version of Fusion MPT base driver(mptbase:PrimeIocFifos()) is requesting for a bigger chunk(376832 bytes) of pci_alloc_consistent mapping. Currently the limit on maximum allowable contiguous chunk with swiotlb is 128 (IO_TLB_SEGSIZE)*2KB. So Increasing the IO_TLB_SEGSIZE to 256 makes the panic in comment #57 go away. As Dely mentioned, we seem to be having another issue with megaraid2 driver. We will check if it has to do anything with swiotlb and getback to you monday morning.
For sighting #3 in comment #60, we identified that it was a megaraid2 driver issue. We put in a patch (submitted for bugzilla 154028) to the driver and the rmmoding and insmoding of megaraid2 driver worked fine without panic. Therefore, Larry's 4/11 patch + increasing IO_TLB_SEGSIZE = 256 (as suggested in comment #57) fixes the issue.
Correction: increasing IO_TLB_SEGSIZE to 256 is suggested in comment #61.
Created attachment 118765 [details] SWIOTLB patch fixing the issue Larry, This is the patch which we have tested and works. This is essentially same as your patch in comment #18 with an additional change of IO_TLB_SEGSIZE increased to 256. We are doing extensive validation of this patch. Dely will post those results as soon as they are available.
Created attachment 118873 [details] Patch fixing the issue Dale Busacker from Intel did more validation of the patch in comment #64 and found out that the Adaptec RAID, ASR2230 requests 634880 bytes of contiguous memory and needs IO_TLB_SEGSIZE increased to 512 for making it functional. We request Redhat to pickup the patch attached to this comment, which increased IO_TLB_SEGSIZE to 512. We tested this patch with the below listed scsi controllers(and their driver versions) successfully. qla2300 7.05.00-RH1 lpfc 7.3.2 mptscsih 2.06.16.01 aacraid 1.1-5[2361] megaraid2 2.10.10.1(along with the driver fix posted in bugzilla #154028)
For the past few RHEL3 updates we have deferred fixing the DMA allocation fialures on EM64T. The problem is that the ia32e systems do not have hardware IOMMUs so we must allocate DMA zone memory for DMA buffers whenever there is more than 4GB of RAM. The reason for this is that there are 2 zones; the DMA zone for physical addresses between 0 and 16MB and the Normal zone for physical addresses between 16MB and the end of RAM. If there is more than 4GB of RAM on the system we must allocate DMA buffers from the 16MB DMA zone because we cant be sure that the Normal zone page is below the 4GB boundary. Because the DMA zone is so small(only 16MB or 4096 pages) pci_alloc_consistant() frequently fails which results in driver loading failures, etc. In order to solve this problem without backporting the RHEL4 changes I have added a boot-time option to increase the size of the DMA zone. This allows one to increase the size of the DMA zone and therefore significantly reduce the risk of failing to allocate DMA buffers. This solution does however eliminate the possibility of using 24-bit isa devices when the DMA zone has veen increased above 16MB. This is suposidly not a problem because no one uses these devices in the EM64T systems anyway. What does Intel think about this? --- linux-2.4.21/arch/x86_64/kernel/e820.c.orig +++ linux-2.4.21/arch/x86_64/kernel/e820.c @@ -139,6 +139,11 @@ unsigned long end_pfn_map; unsigned long end_user_pfn = MAXMEM>>PAGE_SHIFT; /* + * last DMA zone pfn + */ +unsigned long end_dma_pfn = 0; + +/* * Find the highest page frame number we have available */ @@ -570,6 +575,11 @@ void __init parse_cmdline_early (char ** from+=8; setup_io_tlb_npages(from); } + + else if (!memcmp(from, "maxdma=", 7)) { + end_dma_pfn = memparse(from+7, &from); + end_dma_pfn >>= PAGE_SHIFT; + } #endif #ifdef CONFIG_ACPI_PMTMR else if (!memcmp(from, "pmtmr", 5)) { --- linux-2.4.21/arch/x86_64/mm/numa.c.orig +++ linux-2.4.21/arch/x86_64/mm/numa.c @@ -79,6 +79,8 @@ void __init setup_node_bootmem(int nodei EXPORT_SYMBOL(maxnode); +extern unsigned long end_dma_pfn; + /* Initialize final allocator for a zone */ void __init setup_node_zones(int nodeid) { @@ -93,9 +95,14 @@ void __init setup_node_zones(int nodeid) end_pfn = PLAT_NODE_DATA(nodeid)->end_pfn; printk("setting up node %d %lx-%lx\n", nodeid, start_pfn, end_pfn); - - /* All nodes > 0 have a zero length zone DMA */ - dma_end_pfn = __pa(MAX_DMA_ADDRESS) >> PAGE_SHIFT; + + /* bootline maxdma= option overrides MAX_DMA_ADDRESS */ + if (end_dma_pfn) + dma_end_pfn = end_dma_pfn; + else + dma_end_pfn = __pa(MAX_DMA_ADDRESS) >> PAGE_SHIFT; + + /* All nodes > 0 have a zero length zone DMA */ if (start_pfn < dma_end_pfn) { zones[ZONE_DMA] = dma_end_pfn - start_pfn; zones[ZONE_NORMAL] = end_pfn - dma_end_pfn;
Created attachment 123878 [details] Fix to larrys proposal in comment #73 Larry, Your proposed patch in comment #73 doesn't work as it is. Attached patch has the fixes to your proposal and this patch worked on couple of our test setups. We are currently doing more tests with this patch. Larry, What do you think about your proposal now? Modules built with this proposed patch may not work on earlier update kernels. Is that Ok?
I tested out the patch that Suresh submitted above and it worked fine on system with 6GB memory, 6 MegaRAID cards, 1 Adaptec ASR2230S card. I tried using maxdma = 24M, 32M, 48M, and 64M and I didn't see any error messages from the megeraid2 and aacraid modules. The NIC devices came up fine and the USB keyfob worked fine.
Suresh, MAX_DMA_ADDRESS is the only change. Why is this necessary, it breaks the KABI. Larry
Larry, Please look at the usage of MAX_DMA_ADDRESS in specifying the goal for the alloc_bootmem routines(include/linux/bootmem.h).. These routines exhaust the memory above 16MB (and which falls into our extended dma zone) for allocating bootmem..
I realize this, however changing MAX_DMA_ADDRESS breaks the kernel ABI. Larry
how about modifying bootmem routines(include/linux/bootmem.h) to use end_dma_pfn?
Changing #defines in bootmem.h probably breaks the kernel ABI as well because drivers that use it probably #include that file. Larry
Created attachment 127562 [details] Patch as per conversation with Larry. Larry, as per our discussion here is the patch. (modifying include/linux/bootmem.h) Dely, please include your test results of this patch as soon as possible. thanks.
Created attachment 127564 [details] My cut to the same patch... Hi Suresh, here is my cut of the patch. It looks almost identical :) Larry BTW, have you had a change to do any testing?
BTW, the binary rpm with my patch is located here: >>>http://people.redhat.com/~lwoodman/.for_intel/ Larry
This issue is on Red Hat Engineering's list of planned work items for the upcoming Red Hat Enterprise Linux 3.8 release. Engineering resources have been assigned and barring unforeseen circumstances, Red Hat intends to include this item in the 3.8 release.
I'll test the patch today.
I've updated the patch based on internal Red Hat feedback and place a new kernel binary rpm for testing in: >>>http://people.redhat.com/~lwoodman/.for_intel/
Created attachment 127624 [details] maxdma patch used in the above kernel and to be reviewed by Intel Latest maxdma= boot option patch based on internal Red Hat review.
I tested out the rpm posted by Larry on Comment #93on a system with 6GB memory, 6 MegaRAID cards, 1 Adaptec ASR2230S card. I tried using maxdma = 24M and 32M. I didn't see any error messages from the megeraid2 and aacraid modules. The NIC devices came up fine and the USB keyfob worked fine. Without using maxdma on 2.4.21-40.6.ia32e.EL or using the RHEL 4 U3 kernel, I saw error message from megaraid2 like "RAID: Can't allocate passthru" and system hung on initilaizing USB controller.
Created attachment 127635 [details] dmesg from run with maxdma=32M This is the dmesg from test run using Larry's rpm with maxdma=32M.
Created attachment 127636 [details] dmesg from run w/o using maxdma boot option This is the dmesg from test run using Larry's rpm but w/o using maxdma boot parameter.
Please grab the latest patch/latest rpm and re-run the test. Thanks, Larry
The test run was done with the rpm on comment #93. Where is the latest patch or rpm mentioned on #98?
Sorry, the latest patch and rpm was the one you tested from comment 93 and 94. I was just afraid that you didnt have the latest one. Larry Woodman
I didn't receive a notification when you sent comment #97, so I didn't respond sooner on that. Yes, the test was done with rpm from comment #93. Does the dmesg from using maxdma=32M look ok for you? I did see the differnece from testing the latest rpm with or without using maxdma boot option.
A fix for this problem has just been committed to the RHEL3 U8 patch pool this evening (in kernel version 2.4.21-40.7.EL).
A kernel has been released that contains a patch for this problem. Please verify if your problem is fixed with the latest available kernel from the RHEL3 public beta channel at rhn.redhat.com.
Reverting to ON_QA.
I tested RHEL 3 U8 private beta on on a Harwich system with 6GB memory, 6 MegaRAID cards, and 1 Adaptec ASR2230S card. Using maxdma=32M boot paramter both during installation and subsequent boot, I was able to install and boot on RHEL 3 U8 private beta. Without using maxdma=32M, kernel panic was encountered during installation and subsequent boots.
This bugzilla can be closed. On a Harwich system with 6GB memory, 6 MegaRAID cards, and 1 Adaptec ASR2230S card, I was able to install and boot on RHEL 3 U8 Beta 2 using maxdma=32M boot paramter on both installation and subsequent boot.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0437.html