Description of problem: When running openmpi over infiniband fabric when one of the nodes has more than one HCA openmpi gives the following warning: -------------------------------------------------------------------------- WARNING: There are more than one active ports on host 'dell-pe1950-03.rhts.boston.redhat.com', but the default subnet GID prefix was detected on more than one of these ports. If these ports are connected to different physical IB networks, this configuration will fail in Open MPI. This version of Open MPI requires that every physically separate IB subnet that is used between connected MPI processes must have different subnet ID values. Please see this FAQ entry for more details: http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid NOTE: You can turn off this warning by setting the MCA parameter btl_openib_warn_default_gid_prefix to 0. -------------------------------------------------------------------------- The other node has this: # ibstat CA 'ipath0' CA type: InfiniPath_QLE7140 Number of ports: 1 Firmware version: Hardware version: 1 Node GUID: 0x0011750000687070 System image GUID: 0x0011750000687070 Port 1: State: Active Physical state: LinkUp Rate: 10 Base lid: 8 LMC: 0 SM lid: 1 Capability mask: 0x02010800 Port GUID: 0x0011750000687070 CA 'mthca0' CA type: MT25204 Number of ports: 1 Firmware version: 1.0.700 Hardware version: a0 Node GUID: 0x0002c9020020f29c System image GUID: 0x0002c9020020f29f Port 1: State: Active Physical state: LinkUp Rate: 10 Base lid: 6 LMC: 0 SM lid: 1 Capability mask: 0x02510a68 Port GUID: 0x0002c9020020f29d Both HCAs are in the same network, with the SM lid 1 .. So that warning seems to be bogus. Version-Release number of selected component (if applicable): RHEL5.2 tree. How reproducible: Everytime. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
This isn't a bug. The warning is correct. The common IB setup method doesn't use link aggregation on the same subnet, it usually involves links to redundant fabrics. That means that port 1 and port 2 are usually on different networks. If they both have the same GID prefix, but are on two networks, then openmpi will fail to run. So, whenever openmpi detects what looks like the uncommon "dual links on single subnet" configuration, it prints the warning above in case the machine really is on two subnets but the admins simply forgot to configure opensm to have a different GID prefix on the different subnets (it's impossible for openmpi to know if the links are *actually* on the same subnet). If the person checks things out and determines that this is in fact all correct and good and the machine is on the same subnet with both ports, then the user can add the option listed in the warning in order to shut the warning up.