Bug 197979

Summary: QLogic InfiniPath problem with >= 4GB memory, and mmap64 of dma_alloc_coherent() memory
Product: Red Hat Enterprise Linux 4 Reporter: Dave Olson <dave.olson>
Component: kernelAssignee: Doug Ledford <dledford>
Status: CLOSED NOTABUG QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4.0CC: bugproxy, bwthomas, coughlan, gozen, jbaron, jburke, konradr, lcm, rjwalsh
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-07-17 18:10:39 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dave Olson 2006-07-07 19:56:10 UTC
Description of problem:

The InfiniPath 1.3 Infiniband driver allocates a single page of memory with
dma_alloc_coherent().   This memory is used for two purposes.   The first
portion is used as a DMA target by the InfiniPath chip.  The second portion
(offset 128 bytes) is used to share status information between the driver
and InfiniPath user-mode programs.

The driver is available from the www.pathscale.com website after registratrion,
and substantially the same driver is part of the OpenFabrics(OpenIB) 1.0
distribution, and is also now in the 2.6.18 kernel.org kernel.  The driver
source is in drivers/infiniband/hw/ipath/*

The memory is mapped by user programs using mmap64 (mmap64 is used because
the addresses can be more than 32 bits, and mmap(), even with
-D_LARGEFILE64_SOURCE truncates the address on some glibc and kernel versions).
The problem is seen by both 32 and 64 bit programs.

On RHEL4, both update2 and update3 (2.6.9-22 and 2.6.9-34 kernels), data written
into this memory by both the hardware, and by the device driver, is
not seen by the user program, even though all addresses appear to be correct,
and no errors are reported at any point.  The problem only occurs if the
system has 4GB or more physical memory installed.   

This problem is seen on multiple opteron motherboards, multiple BIOS versions,
and also on Intel Woodcrest systems.

The problem is not seen with Fedora Core 4 2.6.14, 2.6.15, or 2.6.16
distributions and kernels, it is not seen with SLES10 2.6.16 kernels,
and it is not seen with kernel.org 2.6.16 or 2.6.17 kernels running on
a RHEL4 UP2 or UP3 base installation.

We have found a workaround for BIOSes that have the option to disable
system memory remapping around the I/O address hole that is normally
located just below 4GB.  When that is disabled, no problems are observed
even on the RHEL4 kernels.  Unfortunately, not all BIOSes offer this option,
and even when it is available, it can result in up to 1GB of system memory
being made unavailable, which is unacceptable to some customers.

The kernels are the standard installed kernels, they have not been reconfigured
or rebuilt.

The memory allocation in the driver is done in ipath_init_chip.c as:

 dd->ipath_pioavailregs_dma = dma_alloc_coherent(
        &dd->pcidev->dev, PAGE_SIZE, &dd->ipath_pioavailregs_phys,
#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,10)
        __GFP_REPEAT |
#endif
        GFP_KERNEL);
#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,15)
    SetPageReserved(virt_to_page(dd->ipath_pioavailregs_dma));
#endif

The status shared information is setup as:
        dd->ipath_statusp = (u64 *)
        ((char *)dd->ipath_pioavailregs_dma +
         ((2 * L1_CACHE_BYTES +
           dd->ipath_pioavregs * sizeof(u64)) & ~L1_CACHE_BYTES));

The mmap64 call is done with these arguments:

mmap64(0, PAGE_SIZE, PROT_READ, MAP_SHARED,
                fd, (__off64_t)b->spi_pioavailaddr));

where spi_pioavailaddr is returned by the driver during an initialization
call in ipath_file_ops.c

The mmap handler is also in ipath_file_ops.c in ipath_mmap()
with a call to ipath_mmap_mem which uses 

  remap_page_range(vma, vma->vm_start, pfn<<PAGE_SHIFT, len,
      vma->vm_page_prot);

Comment 1 Tom Coughlan 2006-07-10 18:05:46 UTC
If you would like to try a test with a preliminary version of the RHEL 4 U4
kernel, you can get one here:

http://people.redhat.com/jbaron/rhel4/

I do not know of a fix in this area, but it would help to try the latest. 

I'm passing this along to our Infiniband guy. 

Comment 2 IBM Bug Proxy 2006-07-12 19:06:29 UTC
----- Additional Comments From luvella.com  2006-07-12 15:13 EDT -------
From Dave Olson:
OK.  We made the changes to compile with the RHEL4 UP4 Beta kernel, and UP4
unfortunately shows exactly the same behavior as UP3 in this area.

We'll try a full update to UP4, just in case library changes have an
effect, but I don't expect it to help.

I'll add this info to the bug report, as well, once we've tested with
the full UP4 beta distro update. 

Comment 3 Jeff Burke 2006-07-12 20:06:04 UTC
Glen,
    Just to make sure we ensure we are discussing the same kernel can you please
provide a kernel version I.E 2.6.9-40.EL 

Comment 4 Konrad Rzeszutek 2006-07-12 20:56:17 UTC
What kind of adapters is it? PCIe or the HTX ones? 

Comment 5 Dave Olson 2006-07-12 21:40:45 UTC
We are currently using the HTX adapters, although the problems show up
with both PCIe and HTX.  We are using the RHEL4 UP4 beta that has the
2.6.9-39 kernel, uname -a reports:
  Linux iqa-13 2.6.9-39.ELsmp #1 SMP Thu Jun 1 18:01:55 EDT 2006 x86_64 x86_64
x86_64 GNU/Linux

We've tried both the AS kernel and the WS kernel; the system is installed
with the WS distribution, at the moment.

I've now repeating the testing with the system fully updated to RHEL4 UP4 Beta
and we still see the problem (as expected, since it's almost certainly a
kernel issue).

We are using the shipped kernel, and therefore the standard .config

I've also tried this with kernel.exec-shield-randomize=0, as well as the
default value of 1 (we normally set that to 0 on RHEL4 systems to work
around some occasional issues with shared memory setup).  The behavior is
the same, for both settings.

All of this is with no special kernel boot line command options.  I also
tried it with selinux=0, which we often use, both with and without
randomize, and that seems to avoid this particular problem (I have not
gone back to the RHEL4 UP3 kernel to see if selinux=0 helps there or not).

Unfortunately, it then exposes the next problem (which I was afraid might
be the case), which is that our packet DMA does not show up in the physical
memory location in which we expect.   We had also seen this problem in the
past, but it was difficult to get past the first problem to observe it.

I can open a new bug on that issue, if desired.   In brief, if
dma_alloc_coherent() returns a physical address above 4GB, then
the data being DMA'ed to that location does not show up at that
virtual address.   That memory is also being mmap'ed with mmap64 by
the user program.



Comment 6 Dave Olson 2006-07-12 21:51:52 UTC
By the way, our chip has 64bit DMA addressing capability, and we therefore
use the 64 bit DMA mask:  pci_set_dma_mask(pdev, DMA_64BIT_MASK), so
the iommu should not be getting used.

I'll try changing it to only set the 32 bit DMA mask, to see if that
changes the behavior by forcing the use of the iommu.

Comment 7 Robert Walsh 2006-07-12 22:40:57 UTC
Just a quick note: I'm about to try out the 2.6.9-41 kernel's shipping
InfiniPath driver (i.e. the one from OFED-1.0) to see if that exhibits the same
problem.  I'll update the bug as soon as I have an answer.

Comment 8 Robert Walsh 2006-07-13 00:57:51 UTC
Unfortunately, it looks like the fix for bug 194289 (which I can't access, for
some reason) basically means our device driver won't work at all in the 2.6.9-41
kernel.  As a quick background: the 2.6.9-41 kernel now implies PROT_READ if
PROT_WRITE is specified to mmap.  Our driver checks that device write-only
memory is not mmap'd as readable and refuses to mmap it if it is.  When I manage
to get past this hurdle, I'll update the bug.

Comment 9 Dave Olson 2006-07-13 03:28:40 UTC
I have discovered that forcing our driver to only use a 32 bit dma mask        
                           pci_set_dma_mask(pdev, DMA_32BIT_MASK) and          
                                                     
pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK), rather than the 64BIT       
                           versions will, with no changes to the kernel or
kernel configuration or                                    kernel cmdline boot
parameters, allows our InfiniPath driver to work, on 
the UP4 system.


On UP3 (2.6.9-34.ELsmp  kernel), this change does not help (which matches
up with our memory of our earlier testing on this issue).  So UP4 at least
offers us an opportunity for a workaround.

On the  UP4 2.6.9-39.ELsmp kernel, at least, we have performance problems
with this workaround.

It seems to have a significant performance impact for InfiniPath MPI
operations over InfiniBand adding about 0.7 to 1.0 usec latency for 0-byte
packets and limiting bandwidth quite significantly for the smaller packets
where data is copied from one buffer to another, when the system with the
workaround is receiving data (but not when it is sending it). 

Additionally, that workaround seems to have a quite significant impact on
shared memory latency and bandwidth with the standard MPI benchmarks.
I don't yet understand why this should be the case, but it is quite
major (10.0 usec latency for 0 byte packets vs 0.8 usec without the
change).  Similarly shared memory MPI peak bandwidth drops significantly.

The shared memory issue shows up only on some runs, and seems somewhat
sticky across reboots, so it may depend on where in physical memory the
(fairly large) shared memory segment is allocated.  The Infiniband latency
and bandwidth problem show up pretty much every run.  Perhaps we are having
to take faults for some of the IOMMU stuff, although I didn't think it
worked that way, for accesses from the processor?


I'll continue to research the slowdown issue.  I can't see why the use of
the IOMMU (which is forced by using the 32 bit dma mask) should have such
a large effect on memory copies, but it certainly seems quite dramatic.
Regular memory throughput benchmarks don't seem to be affected.  About
all that I can think of is that we are taking some huge number of faults

The infinipath patch for this (diff -u) is as follows:
--- ipath_driver.c-orig 2006-07-12 17:17:10.000000000 -0700
+++ ipath_driver.c      2006-07-12 17:17:26.000000000 -0700
@@ -377,6 +377,8 @@
        }
 
        ret = pci_set_dma_mask(pdev, DMA_64BIT_MASK);
+       dev_info(&pdev->dev, "Forcing use of 32 bit DMA\n"); ret = 1; // OLSON OLSON
+
        if (ret) {
                /*
                 * if the 64 bit setup fails, try 32 bit.  Some systems



Comment 10 Jeff Burke 2006-07-13 21:20:17 UTC
Dave Olson:
   Did you or can you open a new issue for Comment #12:
"I can open a new bug on that issue, if desired. In brief, if
dma_alloc_coherent() returns a physical address above 4GB, then
the data being DMA'ed to that location does not show up at that
virtual address.  That memory is also being mmap'ed with mmap64 by
the user program."

Robert Walsh:
   Can you please elaborate on what you mean by "won't work at all" from Comment
#8. Will the driver fail to load? 

   Konrad Rzeszutek and I are activly tying to reproduce this issue. We seem to
be stumbling over several issues.
 1) We only have one pathscale card, Currently we have an open bug that says
"Mellanox Technologies MT25208" and Pathscale cards are having an
interoperablity issue. (I am not sure if this is still true, but this is the
information Gurhan just told me on the phone.)

 2) The descripton of the problem is very vague about the hardware that the
pathscale card uses "This problem is seen on multiple opteron motherboards,
multiple BIOS versions, and also on Intel Woodcrest systems." So we are not sure
if the hardware we are trying to reproduce this with will even show this issue.

  We have tested it on a Dell Power Edge 830, Also was a AMD sample Sahara. We
do not have IBM boxes with HTX slots. We are going to drop back to the Intel
Woodcrest system. But we need to get that from the Performance Engineering Group.


Comment 11 Robert Walsh 2006-07-13 21:45:13 UTC
Here's some clarification on Comment #8.  The driver loads just fine.  Our user
space MPI library then attempts to mmap some chip memory PROT_WRITE (no
PROT_READ), and our driver gets confused because the mmap code in the kernel
adds PROT_READ into the flags (that's what the fix for bug 194289 did.)  We have
an explicit check in our driver to make sure that PROT_READ isn't set, since the
chip memory that's being mmap'd is not readable, only writable.

The good news is that Jason came up with a different fix for bug 194289 that
resolves this issue.  I was then able to verify that the driver in the kernel
hit the same problem that we're seeing with our out-of-kernel build of the driver.

Comment 12 Konrad Rzeszutek 2006-07-13 22:26:35 UTC
After a bit of trying different machines (and inadvertly causing a power supply
to spit out magic smoke) we got the Pathscale PCIe and Mellanox cards talking to
each other. The Pathscale PCIe is in a Intel Woodcrest sample box with 4GB of RAM.

What kind of tests/source code should we run to reproduce this? And how do we
detect that the test does reproduce this problem? The initial description states
that: " no errors are reported at any point".

Thanks.

Comment 13 Konrad Rzeszutek 2006-07-13 22:36:44 UTC
FYI: The kernel we have running is 2.6.9-40.ELsmp on both machines.

Comment 14 Dave Olson 2006-07-14 01:20:46 UTC
I have opened a new bug on the problem, as requested.  It is bug 198847.
That bug has details on how to run the programs to see the problem, but
requires registering for and downloading the infinipath 1.3 release from
the pathscale.com website.  We can send it to you as a tarball attachment
if registering on our site is a problem.

To run infinipath mpi as mentioned in that report, you'll need to use
the patshcale 1.3 infinipath release driver, or the UP4 kernel 2.6.9-41
available from Jason, as 2.6.9-40 has an mmap behavior change that breaks
our user code; 2.6.9-41 has a different fix for the bug that works with
infinipath.

Comment 15 Konrad Rzeszutek 2006-07-14 12:40:05 UTC
Dave,

Please send the tarball or just provide the exact URL to your company website. I
don't know exactly where to register to get the driver :-(

Comment 16 Konrad Rzeszutek 2006-07-14 15:02:49 UTC
Dave,
Ignore my previous comment pls. Google helped me find them.

Comment 17 Konrad Rzeszutek 2006-07-14 15:47:57 UTC
FYI: We are running now 2.6.9-41.ELsmp kernel.

Comment 18 Betsy Zeller 2006-07-14 23:45:16 UTC
Konrad - we have an additional board packed up and ready to send as soon as we
can get your shipping address. Have you had any luck so far reproducing the problem?

Comment 19 Jeff Burke 2006-07-15 00:14:59 UTC
Betsy,
 I have sent you a email directly that has my home address if you can send the
cards overnight for a Saturday delivery that would be great.

 I just got off the phone with Konrad and when the cards arrive he will meet me
in the office.

Jeff

Comment 21 Dave Olson 2006-07-17 18:06:23 UTC
I'm going to suggest closing this bug (I don't have the permissions to do it
myself).

This particular problem never occurred in RHEL4 UP4, although that seems somewhat
by chance.   The fix (to the infinipath driver) was identified by konradr in 
bug 198847.  That changes fixes this problem as well as the secondary problem
in UP3, and a request has been made to apply the fix to the infinipath driver
in UP4.

Comment 22 Jason Baron 2006-07-17 18:10:39 UTC
fine with me :)