Bug 435503

Summary: openmpi gets confused with multiple HCAs
Product: Red Hat Enterprise Linux 5 Reporter: Gurhan Ozen <gozen>
Component: openmpiAssignee: Doug Ledford <dledford>
Status: CLOSED NOTABUG QA Contact:
Severity: low Docs Contact:
Priority: low    
Version: 5.2CC: jburke
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-03-03 18:03:40 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Gurhan Ozen 2008-02-29 19:05:40 UTC
Description of problem:
 When running openmpi over infiniband fabric when one of the nodes has more than
one HCA openmpi gives the following warning:


--------------------------------------------------------------------------
WARNING: There are more than one active ports on host
'dell-pe1950-03.rhts.boston.redhat.com', but the
default subnet GID prefix was detected on more than one of these
ports.  If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI.  This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

  http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------

The other node has this:
# ibstat
CA 'ipath0'
        CA type: InfiniPath_QLE7140
        Number of ports: 1
        Firmware version: 
        Hardware version: 1
        Node GUID: 0x0011750000687070
        System image GUID: 0x0011750000687070
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 10
                Base lid: 8
                LMC: 0
                SM lid: 1
                Capability mask: 0x02010800
                Port GUID: 0x0011750000687070
CA 'mthca0'
        CA type: MT25204
        Number of ports: 1
        Firmware version: 1.0.700
        Hardware version: a0
        Node GUID: 0x0002c9020020f29c
        System image GUID: 0x0002c9020020f29f
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 10
                Base lid: 6
                LMC: 0
                SM lid: 1
                Capability mask: 0x02510a68
                Port GUID: 0x0002c9020020f29d

Both HCAs are in the same network, with the SM lid 1 .. So that warning seems to
be bogus.

Version-Release number of selected component (if applicable):
RHEL5.2 tree.

How reproducible:
Everytime.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Doug Ledford 2008-03-03 18:03:40 UTC
This isn't a bug.  The warning is correct.  The common IB setup method doesn't
use link aggregation on the same subnet, it usually involves links to redundant
fabrics.  That means that port 1 and port 2 are usually on different networks. 
If they both have the same GID prefix, but are on two networks, then openmpi
will fail to run.  So, whenever openmpi detects what looks like the uncommon
"dual links on single subnet" configuration, it prints the warning above in case
the machine really is on two subnets but the admins simply forgot to configure
opensm to have a different GID prefix on the different subnets (it's impossible
for openmpi to know if the links are *actually* on the same subnet).  If the
person checks things out and determines that this is in fact all correct and
good and the machine is on the same subnet with both ports, then the user can
add the option listed in the warning in order to shut the warning up.