Bug 197979
Summary: | QLogic InfiniPath problem with >= 4GB memory, and mmap64 of dma_alloc_coherent() memory | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Dave Olson <dave.olson> |
Component: | kernel | Assignee: | Doug Ledford <dledford> |
Status: | CLOSED NOTABUG | QA Contact: | Brian Brock <bbrock> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 4.0 | CC: | bugproxy, bwthomas, coughlan, gozen, jbaron, jburke, konradr, lcm, rjwalsh |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-07-17 18:10:39 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Dave Olson
2006-07-07 19:56:10 UTC
If you would like to try a test with a preliminary version of the RHEL 4 U4 kernel, you can get one here: http://people.redhat.com/jbaron/rhel4/ I do not know of a fix in this area, but it would help to try the latest. I'm passing this along to our Infiniband guy. ----- Additional Comments From luvella.com 2006-07-12 15:13 EDT ------- From Dave Olson: OK. We made the changes to compile with the RHEL4 UP4 Beta kernel, and UP4 unfortunately shows exactly the same behavior as UP3 in this area. We'll try a full update to UP4, just in case library changes have an effect, but I don't expect it to help. I'll add this info to the bug report, as well, once we've tested with the full UP4 beta distro update. Glen, Just to make sure we ensure we are discussing the same kernel can you please provide a kernel version I.E 2.6.9-40.EL What kind of adapters is it? PCIe or the HTX ones? We are currently using the HTX adapters, although the problems show up with both PCIe and HTX. We are using the RHEL4 UP4 beta that has the 2.6.9-39 kernel, uname -a reports: Linux iqa-13 2.6.9-39.ELsmp #1 SMP Thu Jun 1 18:01:55 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux We've tried both the AS kernel and the WS kernel; the system is installed with the WS distribution, at the moment. I've now repeating the testing with the system fully updated to RHEL4 UP4 Beta and we still see the problem (as expected, since it's almost certainly a kernel issue). We are using the shipped kernel, and therefore the standard .config I've also tried this with kernel.exec-shield-randomize=0, as well as the default value of 1 (we normally set that to 0 on RHEL4 systems to work around some occasional issues with shared memory setup). The behavior is the same, for both settings. All of this is with no special kernel boot line command options. I also tried it with selinux=0, which we often use, both with and without randomize, and that seems to avoid this particular problem (I have not gone back to the RHEL4 UP3 kernel to see if selinux=0 helps there or not). Unfortunately, it then exposes the next problem (which I was afraid might be the case), which is that our packet DMA does not show up in the physical memory location in which we expect. We had also seen this problem in the past, but it was difficult to get past the first problem to observe it. I can open a new bug on that issue, if desired. In brief, if dma_alloc_coherent() returns a physical address above 4GB, then the data being DMA'ed to that location does not show up at that virtual address. That memory is also being mmap'ed with mmap64 by the user program. By the way, our chip has 64bit DMA addressing capability, and we therefore use the 64 bit DMA mask: pci_set_dma_mask(pdev, DMA_64BIT_MASK), so the iommu should not be getting used. I'll try changing it to only set the 32 bit DMA mask, to see if that changes the behavior by forcing the use of the iommu. Just a quick note: I'm about to try out the 2.6.9-41 kernel's shipping InfiniPath driver (i.e. the one from OFED-1.0) to see if that exhibits the same problem. I'll update the bug as soon as I have an answer. Unfortunately, it looks like the fix for bug 194289 (which I can't access, for some reason) basically means our device driver won't work at all in the 2.6.9-41 kernel. As a quick background: the 2.6.9-41 kernel now implies PROT_READ if PROT_WRITE is specified to mmap. Our driver checks that device write-only memory is not mmap'd as readable and refuses to mmap it if it is. When I manage to get past this hurdle, I'll update the bug. I have discovered that forcing our driver to only use a 32 bit dma mask pci_set_dma_mask(pdev, DMA_32BIT_MASK) and pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK), rather than the 64BIT versions will, with no changes to the kernel or kernel configuration or kernel cmdline boot parameters, allows our InfiniPath driver to work, on the UP4 system. On UP3 (2.6.9-34.ELsmp kernel), this change does not help (which matches up with our memory of our earlier testing on this issue). So UP4 at least offers us an opportunity for a workaround. On the UP4 2.6.9-39.ELsmp kernel, at least, we have performance problems with this workaround. It seems to have a significant performance impact for InfiniPath MPI operations over InfiniBand adding about 0.7 to 1.0 usec latency for 0-byte packets and limiting bandwidth quite significantly for the smaller packets where data is copied from one buffer to another, when the system with the workaround is receiving data (but not when it is sending it). Additionally, that workaround seems to have a quite significant impact on shared memory latency and bandwidth with the standard MPI benchmarks. I don't yet understand why this should be the case, but it is quite major (10.0 usec latency for 0 byte packets vs 0.8 usec without the change). Similarly shared memory MPI peak bandwidth drops significantly. The shared memory issue shows up only on some runs, and seems somewhat sticky across reboots, so it may depend on where in physical memory the (fairly large) shared memory segment is allocated. The Infiniband latency and bandwidth problem show up pretty much every run. Perhaps we are having to take faults for some of the IOMMU stuff, although I didn't think it worked that way, for accesses from the processor? I'll continue to research the slowdown issue. I can't see why the use of the IOMMU (which is forced by using the 32 bit dma mask) should have such a large effect on memory copies, but it certainly seems quite dramatic. Regular memory throughput benchmarks don't seem to be affected. About all that I can think of is that we are taking some huge number of faults The infinipath patch for this (diff -u) is as follows: --- ipath_driver.c-orig 2006-07-12 17:17:10.000000000 -0700 +++ ipath_driver.c 2006-07-12 17:17:26.000000000 -0700 @@ -377,6 +377,8 @@ } ret = pci_set_dma_mask(pdev, DMA_64BIT_MASK); + dev_info(&pdev->dev, "Forcing use of 32 bit DMA\n"); ret = 1; // OLSON OLSON + if (ret) { /* * if the 64 bit setup fails, try 32 bit. Some systems Dave Olson: Did you or can you open a new issue for Comment #12: "I can open a new bug on that issue, if desired. In brief, if dma_alloc_coherent() returns a physical address above 4GB, then the data being DMA'ed to that location does not show up at that virtual address. That memory is also being mmap'ed with mmap64 by the user program." Robert Walsh: Can you please elaborate on what you mean by "won't work at all" from Comment #8. Will the driver fail to load? Konrad Rzeszutek and I are activly tying to reproduce this issue. We seem to be stumbling over several issues. 1) We only have one pathscale card, Currently we have an open bug that says "Mellanox Technologies MT25208" and Pathscale cards are having an interoperablity issue. (I am not sure if this is still true, but this is the information Gurhan just told me on the phone.) 2) The descripton of the problem is very vague about the hardware that the pathscale card uses "This problem is seen on multiple opteron motherboards, multiple BIOS versions, and also on Intel Woodcrest systems." So we are not sure if the hardware we are trying to reproduce this with will even show this issue. We have tested it on a Dell Power Edge 830, Also was a AMD sample Sahara. We do not have IBM boxes with HTX slots. We are going to drop back to the Intel Woodcrest system. But we need to get that from the Performance Engineering Group. Here's some clarification on Comment #8. The driver loads just fine. Our user space MPI library then attempts to mmap some chip memory PROT_WRITE (no PROT_READ), and our driver gets confused because the mmap code in the kernel adds PROT_READ into the flags (that's what the fix for bug 194289 did.) We have an explicit check in our driver to make sure that PROT_READ isn't set, since the chip memory that's being mmap'd is not readable, only writable. The good news is that Jason came up with a different fix for bug 194289 that resolves this issue. I was then able to verify that the driver in the kernel hit the same problem that we're seeing with our out-of-kernel build of the driver. After a bit of trying different machines (and inadvertly causing a power supply to spit out magic smoke) we got the Pathscale PCIe and Mellanox cards talking to each other. The Pathscale PCIe is in a Intel Woodcrest sample box with 4GB of RAM. What kind of tests/source code should we run to reproduce this? And how do we detect that the test does reproduce this problem? The initial description states that: " no errors are reported at any point". Thanks. FYI: The kernel we have running is 2.6.9-40.ELsmp on both machines. I have opened a new bug on the problem, as requested. It is bug 198847. That bug has details on how to run the programs to see the problem, but requires registering for and downloading the infinipath 1.3 release from the pathscale.com website. We can send it to you as a tarball attachment if registering on our site is a problem. To run infinipath mpi as mentioned in that report, you'll need to use the patshcale 1.3 infinipath release driver, or the UP4 kernel 2.6.9-41 available from Jason, as 2.6.9-40 has an mmap behavior change that breaks our user code; 2.6.9-41 has a different fix for the bug that works with infinipath. Dave, Please send the tarball or just provide the exact URL to your company website. I don't know exactly where to register to get the driver :-( Dave, Ignore my previous comment pls. Google helped me find them. FYI: We are running now 2.6.9-41.ELsmp kernel. Konrad - we have an additional board packed up and ready to send as soon as we can get your shipping address. Have you had any luck so far reproducing the problem? Betsy, I have sent you a email directly that has my home address if you can send the cards overnight for a Saturday delivery that would be great. I just got off the phone with Konrad and when the cards arrive he will meet me in the office. Jeff I'm going to suggest closing this bug (I don't have the permissions to do it myself). This particular problem never occurred in RHEL4 UP4, although that seems somewhat by chance. The fix (to the infinipath driver) was identified by konradr in bug 198847. That changes fixes this problem as well as the secondary problem in UP3, and a request has been made to apply the fix to the infinipath driver in UP4. fine with me :) |