Bug 435503 - openmpi gets confused with multiple HCAs
Summary: openmpi gets confused with multiple HCAs
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openmpi
Version: 5.2
Hardware: All
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: Doug Ledford
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-02-29 19:05 UTC by Gurhan Ozen
Modified: 2013-11-04 01:35 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-03-03 18:03:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Gurhan Ozen 2008-02-29 19:05:40 UTC
Description of problem:
 When running openmpi over infiniband fabric when one of the nodes has more than
one HCA openmpi gives the following warning:


--------------------------------------------------------------------------
WARNING: There are more than one active ports on host
'dell-pe1950-03.rhts.boston.redhat.com', but the
default subnet GID prefix was detected on more than one of these
ports.  If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI.  This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

  http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------

The other node has this:
# ibstat
CA 'ipath0'
        CA type: InfiniPath_QLE7140
        Number of ports: 1
        Firmware version: 
        Hardware version: 1
        Node GUID: 0x0011750000687070
        System image GUID: 0x0011750000687070
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 10
                Base lid: 8
                LMC: 0
                SM lid: 1
                Capability mask: 0x02010800
                Port GUID: 0x0011750000687070
CA 'mthca0'
        CA type: MT25204
        Number of ports: 1
        Firmware version: 1.0.700
        Hardware version: a0
        Node GUID: 0x0002c9020020f29c
        System image GUID: 0x0002c9020020f29f
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 10
                Base lid: 6
                LMC: 0
                SM lid: 1
                Capability mask: 0x02510a68
                Port GUID: 0x0002c9020020f29d

Both HCAs are in the same network, with the SM lid 1 .. So that warning seems to
be bogus.

Version-Release number of selected component (if applicable):
RHEL5.2 tree.

How reproducible:
Everytime.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Doug Ledford 2008-03-03 18:03:40 UTC
This isn't a bug.  The warning is correct.  The common IB setup method doesn't
use link aggregation on the same subnet, it usually involves links to redundant
fabrics.  That means that port 1 and port 2 are usually on different networks. 
If they both have the same GID prefix, but are on two networks, then openmpi
will fail to run.  So, whenever openmpi detects what looks like the uncommon
"dual links on single subnet" configuration, it prints the warning above in case
the machine really is on two subnets but the admins simply forgot to configure
opensm to have a different GID prefix on the different subnets (it's impossible
for openmpi to know if the links are *actually* on the same subnet).  If the
person checks things out and determines that this is in fact all correct and
good and the machine is on the same subnet with both ports, then the user can
add the option listed in the warning in order to shut the warning up.


Note You need to log in before you can comment on or make changes to this bug.