Bug 986409

Summary: Failure in MPI_Init
Product: [Fedora] Fedora Reporter: Kevin Hobbs <kevin.hobbs.1>
Component: hwlocAssignee: Jiri Hladky <hladky.jiri>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 18CC: dakingun, dledford, fenlason, hladky.jiri, jhladky, orion
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-09 21:43:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
A very simple MPI program
none
The error message none

Description Kevin Hobbs 2013-07-19 17:10:37 UTC
Created attachment 775876 [details]
A very simple MPI program

Description of problem:

On my home computer, which is a Lenovo S10, even the most simple mpi programs fail during MPI_Init.

I noticed the problem on the VTK and ParaView dashboards after upgrading from Fedora 17 to 18.

The simple program is attached.

It compiles fine with :

  mpicc -g -o mpi_simple mpi_simple.c

The compiled program runs on my work machine.

My home machine can run non mpi programs with mpirun:
mpirun -n 4 hostname
murron.hobbs-hancock
murron.hobbs-hancock
murron.hobbs-hancock
murron.hobbs-hancock

but when I run the simple mpi program with :

  mpirun -n 1 mpi_simple

I get an error that includes :

  orte_util_nidmap_init failed

I'll attach the full output.

I've tried running the program in gdb with :

  mpirun -n 1 gdb ./mpi_simple

and setting a breakpoint at the offending call, and stepping from there, but I failed to learn anything except that there's a lot of unpacking in MPI setup.

Version-Release number of selected component (if applicable):

openmpi-1.6.3-7.fc18.x86_64

How reproducible:

Always on one host.

Steps to Reproduce:
1. mpicc -g -o mpi_simple mpi_simple.c
2. mpirun -n 1 mpi_simple
3.

Actual results:

Attached error message.

Expected results:
my rank is 0 of 1

Additional info:

Comment 1 Kevin Hobbs 2013-07-19 17:12:03 UTC
Created attachment 775877 [details]
The error message

Comment 2 Orion Poplawski 2013-07-19 17:37:42 UTC
I can't reproduce.  You might try asking for help on the openmpi list.

You loaded the openmpi module right?  What does "module list" show?  What does "ldd mpi_simple" show?

Comment 3 Kevin Hobbs 2013-07-19 19:06:49 UTC
(In reply to Orion Poplawski from comment #2)
> I can't reproduce.

Since the trouble only occurs on 1 of my hosts I'm not surprised.

>  You might try asking for help on the openmpi list.

I'm slowly building up the nerve to join yet another dev-list.

> 
> You loaded the openmpi module right?

Yup.

>  What does "module list" show?

Currently Loaded Modulefiles:
  1) mpi/openmpi-x86_64

>  What
> does "ldd mpi_simple" show?

        linux-vdso.so.1 =>  (0x00007fff14bb1000)
        libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x0000003c55200000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003c53e00000)
        libc.so.6 => /lib64/libc.so.6 (0x0000003c53200000)
        libdl.so.2 => /lib64/libdl.so.2 (0x0000003c53a00000)
        librt.so.1 => /lib64/librt.so.1 (0x0000003c54200000)
        libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003c6c200000)
        libutil.so.1 => /lib64/libutil.so.1 (0x0000003c6de00000)
        libm.so.6 => /lib64/libm.so.6 (0x0000003c53600000)
        libhwloc.so.5 => /lib64/libhwloc.so.5 (0x0000003c57600000)
        libltdl.so.7 => /lib64/libltdl.so.7 (0x0000003c77000000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003c54a00000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003c52e00000)
        libnuma.so.1 => /lib64/libnuma.so.1 (0x0000003c57200000)
        libpci.so.3 => /lib64/libpci.so.3 (0x0000003c55e00000)
        libxml2.so.2 => /lib64/libxml2.so.2 (0x0000003c5d600000)
        libresolv.so.2 => /lib64/libresolv.so.2 (0x0000003c55a00000)
        libz.so.1 => /lib64/libz.so.1 (0x0000003c54600000)
        liblzma.so.5 => /lib64/liblzma.so.5 (0x0000003c59600000)

Comment 4 Orion Poplawski 2013-07-19 19:11:05 UTC
That all looks good, so no idea what is up.  Note that there is an openmpi user list and a dev list.  You'll want the user list.

Comment 5 Kevin Hobbs 2013-07-21 12:59:46 UTC
I began this thread on the openmpi users list :

http://www.open-mpi.org/community/lists/users/2013/07/22346.php

The problem seems to involve to the version of hwloc available in Fedora versus the version bundled with openmpi.

Comment 6 Kevin Hobbs 2013-07-23 20:49:05 UTC
This bug may actually be a hwloc bug.

After a long thread on the OMPI users list I've learned :

1. openmpi and versions of hwloc close to those available on Fedora 18 and embedded in openmpi do not work well together on my machine unless --disable-libxml2 is used to configure hwloc.

2. I can use the fedora versions of openmpi and hwloc if I set  HWLOC_NO_LIBXML_IMPORT=1 before mpirun.

I will now check openmpi with other versions of hwloc ( those in 1. are ancient)

Comment 7 Kevin Hobbs 2013-07-24 11:58:38 UTC
openmpi-1.6.5 configured with :

  ./configure \
    --prefix=/opt/openmpi-1.6.5_hwloc-1.7.1 \
    --with-hwloc=/opt/hwloc-1.7.1

where hwloc-1.7.1 was configured with :

  ./configure --prefix=/opt/hwloc-1.7.1

*********
* works *
*********

openmpi-1.6.5 configured with :

  ./configure \
    --prefix=/opt/openmpi-1.6.5_hwloc-1.6.2 \
    --with-hwloc=/opt/hwloc-1.6.2

where hwloc-1.6.2 was configured with :

  ./configure --prefix=/opt/hwloc-1.6.2

*********
* works *
*********

openmpi-1.6.5 configured with :

  ./configure \
    --prefix=/opt/openmpi-1.6.5_hwloc-1.5.2 \
    --with-hwloc=/opt/hwloc-1.5.2

where hwloc-1.5.2 was configured with :

  ./configure --prefix=/opt/hwloc-1.5.2

*********
* works *
*********

openmpi-1.6.5 configured with :

  ./configure \
    --prefix=/opt/openmpi-1.6.5_hwloc-1.4.3 \
    --with-hwloc=/opt/hwloc-1.4.3

where hwloc-1.4.3 was configured with :

  ./configure --prefix=/opt/hwloc-1.4.3

*********
* fails *
*********

openmpi-1.6.5 configured with :

  ./configure \
    --prefix=/opt/openmpi-1.6.5_hwloc-1.4.3_noxml2 \
    --with-hwloc=/opt/hwloc-1.4.3_noxml2

where hwloc-1.4.3 was configured with :

  ./configure \
    --prefix=/opt/hwloc-1.4.3_noxml2 \
    --disable-libxml2

*********
* works *
*********

Comment 8 Orion Poplawski 2013-07-24 17:20:32 UTC
Kevin - Thank you very much for the detailed analysis.  Unfortunately this leaves us in a bit of a pickle.  There is an ABI break from 1.4.3 to 1.5, despite no soname bump:

http://upstream-tracker.org/versions/hwloc.html

So I don't think we can just update hwloc in F18.  I'm re-assigning to hwloc for now to get some more input.  I don't know if this was a known issue that was addressed in hwloc and could perhaps get back-ported to 1.4.X.

Comment 9 Jiri Hladky 2013-07-29 10:09:04 UTC
Hi Kevin,

it works fine on my F18 notebook.

In any case, hwloc 1.4.3 is the latest build from 1.4 stream so unfortunately I don't see any way how to fix this in F18.

Please try to use Fedora 19 instead. In F19 hwloc version 1.7 is used. It should solve all the problems.

Thanks a lot!
Jirka

Comment 10 Fedora End Of Life 2013-12-21 15:52:02 UTC
This message is a reminder that Fedora 18 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 18. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '18'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 18's end of life.

Thank you for reporting this issue and we are sorry that we may not be 
able to fix it before Fedora 18 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior to Fedora 18's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.