Bug 198847 - QLogic InfiniPath problem with >= 4GB memory, DMA goes to incorrect address
Summary: QLogic InfiniPath problem with >= 4GB memory, DMA goes to incorrect address
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Konrad Rzeszutek
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 216986
TreeView+ depends on / blocked
 
Reported: 2006-07-14 01:17 UTC by Dave Olson
Modified: 2007-11-30 22:07 UTC (History)
10 users (show)

Fixed In Version: RHBA-2007-0304
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-05-08 02:44:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
dmesg output from 8GB opteron with UP4 and infinipath (13.55 KB, text/plain)
2006-07-15 01:25 UTC, Dave Olson
no flags Details
patch so pathscale infinipath 1.3 driver builds on RHEL4 UP 4 2.6.9-39 and later (6.21 KB, text/plain)
2006-07-15 13:46 UTC, Dave Olson
no flags Details
Patch to 2.6.9-41 tree (612 bytes, patch)
2006-07-16 17:24 UTC, Konrad Rzeszutek
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0304 0 normal SHIPPED_LIVE Updated kernel packages available for Red Hat Enterprise Linux 4 Update 5 2007-04-28 18:58:50 UTC

Description Dave Olson 2006-07-14 01:17:10 UTC
Description of problem:
This is a followup bug to redhat bug 197979.

The InfiniPath PCIe and HyperTransport chips are capable of
doing 64bit DMA.   On x86_64 systems with 4GB or more memory,
the DMA does not show up in the correct location in memory
(the virtual address returned by dma_alloc_coherent()).  We
are using the dmaaddr_t returned by dma_alloc_coherent to
program our chip.

If, on systems that support it, memory remapping around the
I/O address 
hole is disabled, everything works fine, on both
4GB and 8GB systems.   Unfortunately, this workaround is not
acceptable to customers who have i/o devices with large address
ranges, since it can cause loss of up to 1 GByte of memory.

If I change the driver to call pci_set_dma_mask(pdev, DMA_632BIT_MASK),
and pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK), then DMA goes
to the correct location.  This forces the IOMMU to be used.

The workaround causes very significant performance problems
occur with InfiniPath MPI, both across InfiniBand fabric, and 
with shared memory, although the latter is intermittent, and may
depend on where in physical memory the buffers are located.

Version-Release number of selected component (if applicable):
Occurs on RHEL4 UP 3 (2.6.9-34) , and on RHEL4 UP4 beta (2.6.9-39,40
and 41).

The performance problem shows up as an increase from the normal
1.4 usec (on 2.0 GHz Opteron) to 10+ usec.  Bandwidth for medium
size payloads is also very reduced, by 400MByte/sec or more.
Obviously this isn't a good thing.

The problem occurs with the InfiniPath driver downloaded from the
PathScale website from our 1.3 release, and also with the OFED 1.0
InfiniPath driver that is in RHEL4 UP4 Beta2 2.6.9-40 and later
builds.

The InfiniPath MPI is needed to reproduce this, not the MVAPICH
over OpenIB version of MPI.

The benchmark is the osu_latency and osu_bw program, available from
Ohio State University, and included as part of the InfiniPath 1.3
software download (in source and binary form).

The typical symptom of the problem without the workarounds (and on
the driver in the 2.6.9-41 kernel) is a timeout during connection
setup.

Construct a hostfile similar to
   hostname-a
   hostname-b
where "hostname-a" is the name returned by hostname.  Assuming
that file is called "mpihosts", run the infinipath 1.3 mpirun
command from one of the two systems (after being sure that you
can ssh to both systems without a password prompt), as follows:
   mpirun -np 2 -m mpihosts -q 75 -i 30 osu_latency
The -q and -i arguments limit the time the command will wait, in
case the initial connection setup succeeds, but later packets do not.

The expected output is similar to this:
mpirun -np 2 -m ~/tmp/x osu_latency
# OSU MPI Latency Test (Version 2.0)
# Size          Latency (us) 
0               2.05
1               2.05
2               2.06
4               2.05
8               2.06
16              2.23
32              2.30
64              2.46
128             2.65
256             2.95
512             3.55
1024            4.64
2048            6.65
4096            9.36
8192            13.76
16384           22.50
32768           40.13
65536           88.01
131072          164.70
262144          317.92
524288          598.73
1048576         1152.32
2097152         2295.16
4194304         4503.84


To see bandwidth rather than latency, replace osu_latency with osu_bw.

With the workaround in place, osu_bw shows results like this:
# OSU MPI Bandwidth Test (Version 2.0)
# Size          Bandwidth (MB/s) 
1               0.651732
2               1.302800
4               2.606929
8               5.217531
16              8.292083
32              14.647343
64              23.436469
128             33.682716
256             43.019580
512             50.037300
1024            54.353774
2048            56.979434
4096            56.992164
8192            57.025875
16384           57.117413
32768           57.185646
65536           921.954681
131072          946.896986
262144          950.289456
524288          952.420350
1048576         953.194099
2097152         953.575631
4194304         953.762674

Whereas it should show results similar to this:
# OSU MPI Bandwidth Test (Version 2.0)
# Size          Bandwidth (MB/s) 
1               2.163581
2               4.325767
4               8.695674
8               17.378681
16              30.907940
32              61.744786
64              122.494611
128             238.692173
256             426.670041
512             695.119125
1024            851.463330
2048            913.777262
4096            929.623021
8192            942.032119
16384           948.161369
32768           951.366090
65536           912.211750
131072          936.202759
262144          944.793563
524288          947.386518
1048576         950.596909
2097152         952.261498
4194304         953.034304

Comment 1 Konrad Rzeszutek 2006-07-14 15:32:46 UTC
<sigh>

It is what I feared:

[root@dhcp7-149 ~]# mpirun -np 2 -m hostfile -q 75 -i 30 osu_latency
MPI_runscript-pesc430-02.rhts.boston.redhat.com.0: ssh -x> Cannot detect
InfiniPath interconnect.
MPI_runscript-pesc430-02.rhts.boston.redhat.com.0: ssh -x> Seek help on loading
InfiniPath interconnect driver.
MPIRUN: Node program(s) exitted during connection setup
[root@dhcp7-149 ~]#


We received four adapters - two HTX and two PCIe. One of the PCIe we sent to
Dough Ledford (Raleigh). The IBM machine that had the HTX slot was sent back for
motherboard replacement and has not come back, so the HTX cards can't be
utilized, unless I find some AMD box that has these slots.

What we have working is one Pathscale PCIe adapter card and a couple of MT25208
InfiniHostEx Mellanox Technologies. They all talk nicely together, but that is
not helping us here.

Is there a way I can modify the osu_latency to work with the other adapters?

Comment 2 Konrad Rzeszutek 2006-07-14 15:43:42 UTC
Or would it not even matter to get osu_latency to work with the other adapter as
that would invalidate what we are trying to reproduce?

Comment 3 Konrad Rzeszutek 2006-07-14 15:48:00 UTC
FYI: We are running now 2.6.9-41.ELsmp kernel.

Comment 4 Dave Olson 2006-07-14 17:35:19 UTC
This sounds to me like "/etc/init.d/infinipath restart" either wasn't done,
or failed in some way.   This code will work with both the HTX and PCIe
adapters.

You can run ipathbug-helper and email it to me or attach it to the bug,
but I'm betting that the driver didn't load for some reason, so the
first set of things to do are:
 
lspci -n | grep -i 1fc1 # check for adapter present and seen on bus

/etc/init.d/infinipath restart # load the driver, etc.

lsmod | grep ipath_

dmesg | grep ipath

grep ipath /proc/interrupts

ipath_control -i

If all goes well, and at standard debugging levels, there won't be
any output from the dmesg command.  We should have an interrupt
handler registers, and > 0 interrupts.   ipath_control -i should
show that the link is up and a LID assigned, something similar to
this:

$Id: PathScale Release1.3 $ $Date: 2006-07-11-14:26 $
 00: Version: Driver 2.0, InfiniPath_HT-460, InfiniPath1 3.2, PCI 2, SW Compat 2
 00: Status: 0xf1 Initted SMA Present IB_link_up IB_configured
 00: LID=0x68 MLID=0xc002 GUID=00:11:75:00:00:06:e0:72 Serial: 1286040114

(that's for our HyperTransport card, without using openib, the IB_link_up
being present is critical for correct operation; it may take 15-60 seconds
some times for the SM and SMA to negotiate the link up.)

Comment 5 Dave Olson 2006-07-14 17:41:56 UTC
Sorry, I guess I didn't completely answer the question about other
adapters.  No, our InfiniPath software won't work on Mellanox cards.
You can run on a mixture of HT and PCIe cards, but that sounds like
it might not help.

I'm assuming the error message you showed was from the system with
a pathscale adapter, right?

MVAPICH over openib would inter-operate, but it likely won't show the same
problems, because it uses quite different code paths and setup.

If necessary, we can make systems available to you from our lab,
that have RHEL4 UP4 installed, have our adapters installed, and have
serial consoles accessible over the net.  It will take a few hours
to set it up, so it's in our externally accessible DMZ, though.

Robert or I can also run any debug kernels, setup, etc. that you 
would like reports back on, if that's faster.

We can also fedex another PCIe adapter to you, if you give us 
the shipping info, and if that's faster than getting it back from
Doug.  

Comment 6 Konrad Rzeszutek 2006-07-14 17:56:39 UTC
Let me dig around. I want to do a bit of code comparison and see what
differences there are in the kernels from  2.6.9 to 2.6.16 in the affected code
paths. That will take a bit of time.

Could you do two things, please:

 1). Ship the PCIe adapter to me. I will send you an e-mail with my address.
 2). Try different mainline kernels on your test-boxes. Starting from 2.6.10 up
to 2.6.15 to see if the fix is in one of the kernels.
 3). If I have a possible fix today or a debug kernel, we can work on getting
the serial console setup externally to track it further down. Lets wait with
this since I first have to digest the code first.

Thanks.

Comment 7 Dave Olson 2006-07-14 18:10:55 UTC
We've not tried all the mainline kernels, and doing so will involve subtantial
effort to get the driver to build, based on our backporting experience.

We know the problem is not in fc3 2.6.12, fc4 2.6.12,14,15,16, nor in
suse 9.3 2.6.11, suse10 2.6.13, sles10 2.6.16, nor is it in the kernel.org
2.6.16, 2.6.17, nor 2.6.18 kernels.

An earlier version of the driver was also tested on fc3 2.6.11, and that
didn't have the problem either.

Rather than taking the effort to port and validate to a long series
of mainline kernels, I'd prefer to concentrate on the ones that you
think have the highest likelihood of giving a clue as to where the
problem might lie.  Given the list of kernels above, and your knowledge
of VM and/or DMA-related changes, which kernels should we concentrate on
to begin with?

We'll send the PCIe adapter as soon as we get the shipping info

Comment 8 Konrad Rzeszutek 2006-07-14 21:02:48 UTC
Dave,

Thanks for giving me all those kernel release information. This helps
tremendously. Would it also be possible to attach the dmesg output?

Thanks!

Comment 9 Dave Olson 2006-07-14 22:13:11 UTC
In the "normal" case, 'dmesg|grep ipath_' won't show any output at all,
because everything is OK.  So any output you see potentially, at least,
indicates a problem.  Typical messages of interest would be things like
those in the list below.

     mtrr_add(feb00000,0x100000,WC,0) failed (-22) infinipath: probe of
0000:04:01.0 failed with error -22
     Couldn't setup irq handler, irq=%u: %d
     pci_enable_msi failed: %d, interrupts may not work
     Write combining not enabled (err %d): performance may be poor
     Failed to allocate skbuff, length %u
     No interrupts enabled, couldn't setup interrupt address
     Fatal Error (freeze mode), no longer usable

Comment 10 Konrad Rzeszutek 2006-07-14 23:13:09 UTC
Dave,

I was thinking of the full dmesg. I am interested in seeing if SWIOTLB is
enabled on your machine or if IOMMU is disabled.

Comment 11 Dave Olson 2006-07-15 01:23:00 UTC
The iommu definitely isn't disabled, and we shouldn't be using the swiotlb,
because we have the hardware iommu.  Here's the relevant part of the dmesg
output on the 8GB UP4 system

dmesg | egrep -i iommu\|tlb
PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture
Total HugeTLB memory allocated, 0

I've attached the full dmesg output, with the driver forcing 32 bit DMA mask.

By the way, we shipped 2 PCIe InfiniPath cards, to Jeff Burke's home, so you
should have them this weekend.

Comment 12 Dave Olson 2006-07-15 01:25:37 UTC
Created attachment 132473 [details]
dmesg output from 8GB opteron with UP4 and infinipath

Comment 13 Konrad Rzeszutek 2006-07-15 02:31:08 UTC
Comment #7 mentions that the driver worked under 2.6.11 (fc3). Is that the 1.3
version of the driver or the version 1.1 that has demonstrated to work properly?

Comment 14 Konrad Rzeszutek 2006-07-15 02:56:08 UTC
Comment #1 mentions "DMA does not show up in the correct location in memory",
which puzzles me. I have gone over the dma_alloc_coherent code in 2.6.16, 2.6.11
and 2.6.9. When setting up the page with DMA mask set to 64-bit, the driver does
not complain. I ran it with debug options and it happily loaded. No error
messages at all.

Or is it b/c the driver does not utilize this memory segment until an user
application does something? If that is the case, are there any tools or code
paths in the InfiniPath driver that I can call to utilize the 128 bytes. Are
there some magic strings in there so that I can be sure I am getting the right data?

FYI: I am running a 2.6.9-41 kernel with Jason's latest fix to PROT_READ (which
fixes BZ 197979 I gather).

Comment 15 Dave Olson 2006-07-15 03:09:51 UTC
I believe it was a much older version of the driver, somewhere between
our 1.1 and 1.2 infinipath releases, that was tested with large memory
on fc3 2.6.11 kernels.

In respect to Konrad's comment#14, the driver will not complain.  It's
the user programs that don't get the data where it's supposed to be, so
you have to run an infinipath MPI job to see the problem.   That's
the mpirun command as in comment #1, for example.

These are the memory areas allocated with dma_alloc_coherent() that are
then mapped to user addresses, so the user can see the data being
DMA'ed by the InfiniPath chip.

Comment 16 Dave Olson 2006-07-15 13:44:56 UTC
Just in case, I'm attaching a patch to allow the infinipath 1.3 version
of our driver to compile against the RHEL4 UP4 -39 and later kernel.

The infinipath driver in driver/hw/infiniband/ipath in the kernel sources
for 1.40 and later can also be used.

Comment 17 Dave Olson 2006-07-15 13:46:47 UTC
Created attachment 132488 [details]
patch so pathscale infinipath 1.3 driver builds on RHEL4 UP 4 2.6.9-39 and later

Comment 18 Konrad Rzeszutek 2006-07-16 02:48:33 UTC
This might be part of the problem. The 'pfn' is an unsigned (32-bit) value and
the DMA addr (64-bit) which is bit-shifted, the operation result is cast to a
32-bit value (instead of 64-bit) resulting in a truncated value.

--- linux-2.6.9.orig/drivers/infiniband/hw/ipath/ipath_file_ops.c      
2006-07-15 22:26:57.000000000 -0400
+++ linux-2.6.9/drivers/infiniband/hw/ipath/ipath_file_ops.c    2006-07-15
22:27:55.000000000 -0400
@@ -921,7 +921,7 @@
                             int write_ok, dma_addr_t addr, char *what)
 {
        struct ipath_devdata *dd = pd->port_dd;
-       unsigned pfn = (unsigned long)addr >> PAGE_SHIFT;
+       unsigned long pfn = (unsigned long)addr >> PAGE_SHIFT;
        int ret;

        if ((vma->vm_end - vma->vm_start) > len) {

I have not yet run the tests, so I cannot with 100% guarantee say this fixes the
problem.

Comment 19 Dave Olson 2006-07-16 16:23:34 UTC
Since the pfn is a page number, this would only be an issue if the 
DMA address was greater than 44 bits.   Since the Opteron (and even
the latest Intel EM64T processors)  have a 40 bit I/O and physical
memory address limit, that seems extremely unlikely.

However, since 2.6.9 doesn't have remap_pfn_range() (which we use
on newer kernels) the pfn gets shifted left by PAGE_SIZE again by our
compatibility code in our driver (just above the function for which
you show the diff) and passed to remap_page_range(), which
takes a long.  If the compiler is truncating at that point, rather
than keeping all 64 bits, which should happen with a function that
takes a long, with prototype in scope, that could well be the problem.

I made the change and tried it, and early testing it seems to fix the
problem.   Actually, I made the change to the compatibility macro right
above, even though the other users already have a long, just to defend
against future use.

I also ran with driver debug printing out the physical addresses, to
be sure that addresses handled by this code were in fact above 4GB.
There weren't many, apparently the VM tries really hard to allocate DMA
addresses below 4GB, even when running a memory hog, but there were some.
I'll run a more extensive set of tests.

Either Konrad's or my patch should also get made to the ipath_file_ops.c
that is part of the 2.6.9-41 kernel.   Konrad or Jeff, can you one of you
work with Jason to get that to happen, or should one of us at QLogic work
with him on that? 

Here's the patch that I did (I made the cast unsigned long rather than
long to be super paranoid about sign extension, although that "can't happen"
on most, if not all, current architectures):

sh-3.00# diff -u ipath_file_ops.c-orig ipath_file_ops.c
--- ipath_file_ops.c-orig       2006-07-16 08:24:57.000000000 -0700
+++ ipath_file_ops.c    2006-07-16 08:40:02.000000000 -0700
@@ -907,10 +907,10 @@

 #ifndef io_remap_pfn_range
 #define io_remap_pfn_range(vma, addr, pfn, size, prot) \
-       io_remap_page_range((vma), (addr), (pfn) << PAGE_SHIFT, (size), \
+       io_remap_page_range((vma), (addr), ((unsigned long)pfn) << PAGE_SHIFT,
(size), \
                            (prot))
 #define remap_pfn_range(vma, addr, pfn, size, prot) \
-       remap_page_range((vma), (addr), (pfn) << PAGE_SHIFT, (size), \
+       remap_page_range((vma), (addr), ((unsigned long)pfn) << PAGE_SHIFT,
(size), \                            (prot))
 #endif


This code is used on other kernels than 2.6.9 (as I recall, everything
prior to 2.6.16 kernel.org kernels), so it would seem that it's probably
an issue only with the gcc that's part of RHEL4, or perhaps that gcc, plus
the kernel compile options.

In any case, thanks for tracking this down, Konrad!

I tried this same change on the RHEL4 UP3 2.6.9-34 kernel, with the
standard infinipath 1.3 release driver and it seems to fix the problem
there as well, so we seem to have a change that works on both UP3 and UP4.
(I verified >4GB address on UP3 as well, of course.)

Comment 20 Konrad Rzeszutek 2006-07-16 17:24:06 UTC
Created attachment 132526 [details]
Patch to 2.6.9-41 tree

Patch posted to internal reflector for exception + possible inclusion in RHEL4.

Comment 21 Jeff Burke 2006-07-16 19:24:43 UTC
Dave,
    I have tested here with konrad's patch. It is working as expected. I have
not tested with the patch (Qlogic's) from Comment #19. 

    Konrad has posted his patch to the internal mailing list. This now need to
go through the internal Red Hat exception process. Kernel developers will review
the patch. PM, QE and Linda W and Peter M will discuss the exception at the next
meeting. It will be some time on Monday I am sure. Any additional testing data
that you have will only help in potentially getting this patch added.

    Dave O I just want to verify that you did run with Konrad's patch correct? I
have attached your patch to the mail Konrad sent out to the mailing list as
well. Developers can choose the which approach the wish to ack.

Jeff

Comment 22 Dave Olson 2006-07-16 19:35:52 UTC
Yes, I tested with Konrad's patch, as well as the one I noted.
They should be functionally the same.  I'd be very happy if
Konrad's patch makes it into the final UP4 release kernel.

The more extensive testing is being done with my version of
the patch, since that's what we'll want to use going forward
in our driver (in case we add more calls to remap_pfn_rangein
the future).

Those tests, running since my last comment, are all passing.

Thanks again to everybody who has been working on this at
RedHat.

I'm still somewhat curious as to what exactly is causing the
issue that requires the cast, but that's a distant second to
fixing the problem.

Comment 23 Dave Olson 2006-07-17 14:46:15 UTC
My fairly extensive MPI QA suite tests have been running for nearly 24
hours now, and no problems have been seen so far, so I think Konrad
has definitely got the right answer.

Comment 24 Jason Baron 2006-08-21 20:53:13 UTC
committed in stream U5 build 42.2. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/


Comment 25 RHEL Program Management 2006-10-12 23:18:27 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 26 Jay Turner 2006-10-17 14:21:55 UTC
QE ack for 4.5.

Comment 29 Red Hat Bugzilla 2007-05-08 02:44:19 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0304.html


Note You need to log in before you can comment on or make changes to this bug.