Description of problem: This is a followup bug to redhat bug 197979. The InfiniPath PCIe and HyperTransport chips are capable of doing 64bit DMA. On x86_64 systems with 4GB or more memory, the DMA does not show up in the correct location in memory (the virtual address returned by dma_alloc_coherent()). We are using the dmaaddr_t returned by dma_alloc_coherent to program our chip. If, on systems that support it, memory remapping around the I/O address hole is disabled, everything works fine, on both 4GB and 8GB systems. Unfortunately, this workaround is not acceptable to customers who have i/o devices with large address ranges, since it can cause loss of up to 1 GByte of memory. If I change the driver to call pci_set_dma_mask(pdev, DMA_632BIT_MASK), and pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK), then DMA goes to the correct location. This forces the IOMMU to be used. The workaround causes very significant performance problems occur with InfiniPath MPI, both across InfiniBand fabric, and with shared memory, although the latter is intermittent, and may depend on where in physical memory the buffers are located. Version-Release number of selected component (if applicable): Occurs on RHEL4 UP 3 (2.6.9-34) , and on RHEL4 UP4 beta (2.6.9-39,40 and 41). The performance problem shows up as an increase from the normal 1.4 usec (on 2.0 GHz Opteron) to 10+ usec. Bandwidth for medium size payloads is also very reduced, by 400MByte/sec or more. Obviously this isn't a good thing. The problem occurs with the InfiniPath driver downloaded from the PathScale website from our 1.3 release, and also with the OFED 1.0 InfiniPath driver that is in RHEL4 UP4 Beta2 2.6.9-40 and later builds. The InfiniPath MPI is needed to reproduce this, not the MVAPICH over OpenIB version of MPI. The benchmark is the osu_latency and osu_bw program, available from Ohio State University, and included as part of the InfiniPath 1.3 software download (in source and binary form). The typical symptom of the problem without the workarounds (and on the driver in the 2.6.9-41 kernel) is a timeout during connection setup. Construct a hostfile similar to hostname-a hostname-b where "hostname-a" is the name returned by hostname. Assuming that file is called "mpihosts", run the infinipath 1.3 mpirun command from one of the two systems (after being sure that you can ssh to both systems without a password prompt), as follows: mpirun -np 2 -m mpihosts -q 75 -i 30 osu_latency The -q and -i arguments limit the time the command will wait, in case the initial connection setup succeeds, but later packets do not. The expected output is similar to this: mpirun -np 2 -m ~/tmp/x osu_latency # OSU MPI Latency Test (Version 2.0) # Size Latency (us) 0 2.05 1 2.05 2 2.06 4 2.05 8 2.06 16 2.23 32 2.30 64 2.46 128 2.65 256 2.95 512 3.55 1024 4.64 2048 6.65 4096 9.36 8192 13.76 16384 22.50 32768 40.13 65536 88.01 131072 164.70 262144 317.92 524288 598.73 1048576 1152.32 2097152 2295.16 4194304 4503.84 To see bandwidth rather than latency, replace osu_latency with osu_bw. With the workaround in place, osu_bw shows results like this: # OSU MPI Bandwidth Test (Version 2.0) # Size Bandwidth (MB/s) 1 0.651732 2 1.302800 4 2.606929 8 5.217531 16 8.292083 32 14.647343 64 23.436469 128 33.682716 256 43.019580 512 50.037300 1024 54.353774 2048 56.979434 4096 56.992164 8192 57.025875 16384 57.117413 32768 57.185646 65536 921.954681 131072 946.896986 262144 950.289456 524288 952.420350 1048576 953.194099 2097152 953.575631 4194304 953.762674 Whereas it should show results similar to this: # OSU MPI Bandwidth Test (Version 2.0) # Size Bandwidth (MB/s) 1 2.163581 2 4.325767 4 8.695674 8 17.378681 16 30.907940 32 61.744786 64 122.494611 128 238.692173 256 426.670041 512 695.119125 1024 851.463330 2048 913.777262 4096 929.623021 8192 942.032119 16384 948.161369 32768 951.366090 65536 912.211750 131072 936.202759 262144 944.793563 524288 947.386518 1048576 950.596909 2097152 952.261498 4194304 953.034304
<sigh> It is what I feared: [root@dhcp7-149 ~]# mpirun -np 2 -m hostfile -q 75 -i 30 osu_latency MPI_runscript-pesc430-02.rhts.boston.redhat.com.0: ssh -x> Cannot detect InfiniPath interconnect. MPI_runscript-pesc430-02.rhts.boston.redhat.com.0: ssh -x> Seek help on loading InfiniPath interconnect driver. MPIRUN: Node program(s) exitted during connection setup [root@dhcp7-149 ~]# We received four adapters - two HTX and two PCIe. One of the PCIe we sent to Dough Ledford (Raleigh). The IBM machine that had the HTX slot was sent back for motherboard replacement and has not come back, so the HTX cards can't be utilized, unless I find some AMD box that has these slots. What we have working is one Pathscale PCIe adapter card and a couple of MT25208 InfiniHostEx Mellanox Technologies. They all talk nicely together, but that is not helping us here. Is there a way I can modify the osu_latency to work with the other adapters?
Or would it not even matter to get osu_latency to work with the other adapter as that would invalidate what we are trying to reproduce?
FYI: We are running now 2.6.9-41.ELsmp kernel.
This sounds to me like "/etc/init.d/infinipath restart" either wasn't done, or failed in some way. This code will work with both the HTX and PCIe adapters. You can run ipathbug-helper and email it to me or attach it to the bug, but I'm betting that the driver didn't load for some reason, so the first set of things to do are: lspci -n | grep -i 1fc1 # check for adapter present and seen on bus /etc/init.d/infinipath restart # load the driver, etc. lsmod | grep ipath_ dmesg | grep ipath grep ipath /proc/interrupts ipath_control -i If all goes well, and at standard debugging levels, there won't be any output from the dmesg command. We should have an interrupt handler registers, and > 0 interrupts. ipath_control -i should show that the link is up and a LID assigned, something similar to this: $Id: PathScale Release1.3 $ $Date: 2006-07-11-14:26 $ 00: Version: Driver 2.0, InfiniPath_HT-460, InfiniPath1 3.2, PCI 2, SW Compat 2 00: Status: 0xf1 Initted SMA Present IB_link_up IB_configured 00: LID=0x68 MLID=0xc002 GUID=00:11:75:00:00:06:e0:72 Serial: 1286040114 (that's for our HyperTransport card, without using openib, the IB_link_up being present is critical for correct operation; it may take 15-60 seconds some times for the SM and SMA to negotiate the link up.)
Sorry, I guess I didn't completely answer the question about other adapters. No, our InfiniPath software won't work on Mellanox cards. You can run on a mixture of HT and PCIe cards, but that sounds like it might not help. I'm assuming the error message you showed was from the system with a pathscale adapter, right? MVAPICH over openib would inter-operate, but it likely won't show the same problems, because it uses quite different code paths and setup. If necessary, we can make systems available to you from our lab, that have RHEL4 UP4 installed, have our adapters installed, and have serial consoles accessible over the net. It will take a few hours to set it up, so it's in our externally accessible DMZ, though. Robert or I can also run any debug kernels, setup, etc. that you would like reports back on, if that's faster. We can also fedex another PCIe adapter to you, if you give us the shipping info, and if that's faster than getting it back from Doug.
Let me dig around. I want to do a bit of code comparison and see what differences there are in the kernels from 2.6.9 to 2.6.16 in the affected code paths. That will take a bit of time. Could you do two things, please: 1). Ship the PCIe adapter to me. I will send you an e-mail with my address. 2). Try different mainline kernels on your test-boxes. Starting from 2.6.10 up to 2.6.15 to see if the fix is in one of the kernels. 3). If I have a possible fix today or a debug kernel, we can work on getting the serial console setup externally to track it further down. Lets wait with this since I first have to digest the code first. Thanks.
We've not tried all the mainline kernels, and doing so will involve subtantial effort to get the driver to build, based on our backporting experience. We know the problem is not in fc3 2.6.12, fc4 2.6.12,14,15,16, nor in suse 9.3 2.6.11, suse10 2.6.13, sles10 2.6.16, nor is it in the kernel.org 2.6.16, 2.6.17, nor 2.6.18 kernels. An earlier version of the driver was also tested on fc3 2.6.11, and that didn't have the problem either. Rather than taking the effort to port and validate to a long series of mainline kernels, I'd prefer to concentrate on the ones that you think have the highest likelihood of giving a clue as to where the problem might lie. Given the list of kernels above, and your knowledge of VM and/or DMA-related changes, which kernels should we concentrate on to begin with? We'll send the PCIe adapter as soon as we get the shipping info
Dave, Thanks for giving me all those kernel release information. This helps tremendously. Would it also be possible to attach the dmesg output? Thanks!
In the "normal" case, 'dmesg|grep ipath_' won't show any output at all, because everything is OK. So any output you see potentially, at least, indicates a problem. Typical messages of interest would be things like those in the list below. mtrr_add(feb00000,0x100000,WC,0) failed (-22) infinipath: probe of 0000:04:01.0 failed with error -22 Couldn't setup irq handler, irq=%u: %d pci_enable_msi failed: %d, interrupts may not work Write combining not enabled (err %d): performance may be poor Failed to allocate skbuff, length %u No interrupts enabled, couldn't setup interrupt address Fatal Error (freeze mode), no longer usable
Dave, I was thinking of the full dmesg. I am interested in seeing if SWIOTLB is enabled on your machine or if IOMMU is disabled.
The iommu definitely isn't disabled, and we shouldn't be using the swiotlb, because we have the hardware iommu. Here's the relevant part of the dmesg output on the 8GB UP4 system dmesg | egrep -i iommu\|tlb PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture Total HugeTLB memory allocated, 0 I've attached the full dmesg output, with the driver forcing 32 bit DMA mask. By the way, we shipped 2 PCIe InfiniPath cards, to Jeff Burke's home, so you should have them this weekend.
Created attachment 132473 [details] dmesg output from 8GB opteron with UP4 and infinipath
Comment #7 mentions that the driver worked under 2.6.11 (fc3). Is that the 1.3 version of the driver or the version 1.1 that has demonstrated to work properly?
Comment #1 mentions "DMA does not show up in the correct location in memory", which puzzles me. I have gone over the dma_alloc_coherent code in 2.6.16, 2.6.11 and 2.6.9. When setting up the page with DMA mask set to 64-bit, the driver does not complain. I ran it with debug options and it happily loaded. No error messages at all. Or is it b/c the driver does not utilize this memory segment until an user application does something? If that is the case, are there any tools or code paths in the InfiniPath driver that I can call to utilize the 128 bytes. Are there some magic strings in there so that I can be sure I am getting the right data? FYI: I am running a 2.6.9-41 kernel with Jason's latest fix to PROT_READ (which fixes BZ 197979 I gather).
I believe it was a much older version of the driver, somewhere between our 1.1 and 1.2 infinipath releases, that was tested with large memory on fc3 2.6.11 kernels. In respect to Konrad's comment#14, the driver will not complain. It's the user programs that don't get the data where it's supposed to be, so you have to run an infinipath MPI job to see the problem. That's the mpirun command as in comment #1, for example. These are the memory areas allocated with dma_alloc_coherent() that are then mapped to user addresses, so the user can see the data being DMA'ed by the InfiniPath chip.
Just in case, I'm attaching a patch to allow the infinipath 1.3 version of our driver to compile against the RHEL4 UP4 -39 and later kernel. The infinipath driver in driver/hw/infiniband/ipath in the kernel sources for 1.40 and later can also be used.
Created attachment 132488 [details] patch so pathscale infinipath 1.3 driver builds on RHEL4 UP 4 2.6.9-39 and later
This might be part of the problem. The 'pfn' is an unsigned (32-bit) value and the DMA addr (64-bit) which is bit-shifted, the operation result is cast to a 32-bit value (instead of 64-bit) resulting in a truncated value. --- linux-2.6.9.orig/drivers/infiniband/hw/ipath/ipath_file_ops.c 2006-07-15 22:26:57.000000000 -0400 +++ linux-2.6.9/drivers/infiniband/hw/ipath/ipath_file_ops.c 2006-07-15 22:27:55.000000000 -0400 @@ -921,7 +921,7 @@ int write_ok, dma_addr_t addr, char *what) { struct ipath_devdata *dd = pd->port_dd; - unsigned pfn = (unsigned long)addr >> PAGE_SHIFT; + unsigned long pfn = (unsigned long)addr >> PAGE_SHIFT; int ret; if ((vma->vm_end - vma->vm_start) > len) { I have not yet run the tests, so I cannot with 100% guarantee say this fixes the problem.
Since the pfn is a page number, this would only be an issue if the DMA address was greater than 44 bits. Since the Opteron (and even the latest Intel EM64T processors) have a 40 bit I/O and physical memory address limit, that seems extremely unlikely. However, since 2.6.9 doesn't have remap_pfn_range() (which we use on newer kernels) the pfn gets shifted left by PAGE_SIZE again by our compatibility code in our driver (just above the function for which you show the diff) and passed to remap_page_range(), which takes a long. If the compiler is truncating at that point, rather than keeping all 64 bits, which should happen with a function that takes a long, with prototype in scope, that could well be the problem. I made the change and tried it, and early testing it seems to fix the problem. Actually, I made the change to the compatibility macro right above, even though the other users already have a long, just to defend against future use. I also ran with driver debug printing out the physical addresses, to be sure that addresses handled by this code were in fact above 4GB. There weren't many, apparently the VM tries really hard to allocate DMA addresses below 4GB, even when running a memory hog, but there were some. I'll run a more extensive set of tests. Either Konrad's or my patch should also get made to the ipath_file_ops.c that is part of the 2.6.9-41 kernel. Konrad or Jeff, can you one of you work with Jason to get that to happen, or should one of us at QLogic work with him on that? Here's the patch that I did (I made the cast unsigned long rather than long to be super paranoid about sign extension, although that "can't happen" on most, if not all, current architectures): sh-3.00# diff -u ipath_file_ops.c-orig ipath_file_ops.c --- ipath_file_ops.c-orig 2006-07-16 08:24:57.000000000 -0700 +++ ipath_file_ops.c 2006-07-16 08:40:02.000000000 -0700 @@ -907,10 +907,10 @@ #ifndef io_remap_pfn_range #define io_remap_pfn_range(vma, addr, pfn, size, prot) \ - io_remap_page_range((vma), (addr), (pfn) << PAGE_SHIFT, (size), \ + io_remap_page_range((vma), (addr), ((unsigned long)pfn) << PAGE_SHIFT, (size), \ (prot)) #define remap_pfn_range(vma, addr, pfn, size, prot) \ - remap_page_range((vma), (addr), (pfn) << PAGE_SHIFT, (size), \ + remap_page_range((vma), (addr), ((unsigned long)pfn) << PAGE_SHIFT, (size), \ (prot)) #endif This code is used on other kernels than 2.6.9 (as I recall, everything prior to 2.6.16 kernel.org kernels), so it would seem that it's probably an issue only with the gcc that's part of RHEL4, or perhaps that gcc, plus the kernel compile options. In any case, thanks for tracking this down, Konrad! I tried this same change on the RHEL4 UP3 2.6.9-34 kernel, with the standard infinipath 1.3 release driver and it seems to fix the problem there as well, so we seem to have a change that works on both UP3 and UP4. (I verified >4GB address on UP3 as well, of course.)
Created attachment 132526 [details] Patch to 2.6.9-41 tree Patch posted to internal reflector for exception + possible inclusion in RHEL4.
Dave, I have tested here with konrad's patch. It is working as expected. I have not tested with the patch (Qlogic's) from Comment #19. Konrad has posted his patch to the internal mailing list. This now need to go through the internal Red Hat exception process. Kernel developers will review the patch. PM, QE and Linda W and Peter M will discuss the exception at the next meeting. It will be some time on Monday I am sure. Any additional testing data that you have will only help in potentially getting this patch added. Dave O I just want to verify that you did run with Konrad's patch correct? I have attached your patch to the mail Konrad sent out to the mailing list as well. Developers can choose the which approach the wish to ack. Jeff
Yes, I tested with Konrad's patch, as well as the one I noted. They should be functionally the same. I'd be very happy if Konrad's patch makes it into the final UP4 release kernel. The more extensive testing is being done with my version of the patch, since that's what we'll want to use going forward in our driver (in case we add more calls to remap_pfn_rangein the future). Those tests, running since my last comment, are all passing. Thanks again to everybody who has been working on this at RedHat. I'm still somewhat curious as to what exactly is causing the issue that requires the cast, but that's a distant second to fixing the problem.
My fairly extensive MPI QA suite tests have been running for nearly 24 hours now, and no problems have been seen so far, so I think Konrad has definitely got the right answer.
committed in stream U5 build 42.2. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
QE ack for 4.5.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0304.html