Bug 986409
Summary: | Failure in MPI_Init | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Kevin Hobbs <kevin.hobbs.1> | ||||||
Component: | hwloc | Assignee: | Jiri Hladky <hladky.jiri> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 18 | CC: | dakingun, dledford, fenlason, hladky.jiri, jhladky, orion | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2014-01-09 21:43:47 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Created attachment 775877 [details]
The error message
I can't reproduce. You might try asking for help on the openmpi list. You loaded the openmpi module right? What does "module list" show? What does "ldd mpi_simple" show? (In reply to Orion Poplawski from comment #2) > I can't reproduce. Since the trouble only occurs on 1 of my hosts I'm not surprised. > You might try asking for help on the openmpi list. I'm slowly building up the nerve to join yet another dev-list. > > You loaded the openmpi module right? Yup. > What does "module list" show? Currently Loaded Modulefiles: 1) mpi/openmpi-x86_64 > What > does "ldd mpi_simple" show? linux-vdso.so.1 => (0x00007fff14bb1000) libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x0000003c55200000) libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003c53e00000) libc.so.6 => /lib64/libc.so.6 (0x0000003c53200000) libdl.so.2 => /lib64/libdl.so.2 (0x0000003c53a00000) librt.so.1 => /lib64/librt.so.1 (0x0000003c54200000) libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003c6c200000) libutil.so.1 => /lib64/libutil.so.1 (0x0000003c6de00000) libm.so.6 => /lib64/libm.so.6 (0x0000003c53600000) libhwloc.so.5 => /lib64/libhwloc.so.5 (0x0000003c57600000) libltdl.so.7 => /lib64/libltdl.so.7 (0x0000003c77000000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003c54a00000) /lib64/ld-linux-x86-64.so.2 (0x0000003c52e00000) libnuma.so.1 => /lib64/libnuma.so.1 (0x0000003c57200000) libpci.so.3 => /lib64/libpci.so.3 (0x0000003c55e00000) libxml2.so.2 => /lib64/libxml2.so.2 (0x0000003c5d600000) libresolv.so.2 => /lib64/libresolv.so.2 (0x0000003c55a00000) libz.so.1 => /lib64/libz.so.1 (0x0000003c54600000) liblzma.so.5 => /lib64/liblzma.so.5 (0x0000003c59600000) That all looks good, so no idea what is up. Note that there is an openmpi user list and a dev list. You'll want the user list. I began this thread on the openmpi users list : http://www.open-mpi.org/community/lists/users/2013/07/22346.php The problem seems to involve to the version of hwloc available in Fedora versus the version bundled with openmpi. This bug may actually be a hwloc bug. After a long thread on the OMPI users list I've learned : 1. openmpi and versions of hwloc close to those available on Fedora 18 and embedded in openmpi do not work well together on my machine unless --disable-libxml2 is used to configure hwloc. 2. I can use the fedora versions of openmpi and hwloc if I set HWLOC_NO_LIBXML_IMPORT=1 before mpirun. I will now check openmpi with other versions of hwloc ( those in 1. are ancient) openmpi-1.6.5 configured with : ./configure \ --prefix=/opt/openmpi-1.6.5_hwloc-1.7.1 \ --with-hwloc=/opt/hwloc-1.7.1 where hwloc-1.7.1 was configured with : ./configure --prefix=/opt/hwloc-1.7.1 ********* * works * ********* openmpi-1.6.5 configured with : ./configure \ --prefix=/opt/openmpi-1.6.5_hwloc-1.6.2 \ --with-hwloc=/opt/hwloc-1.6.2 where hwloc-1.6.2 was configured with : ./configure --prefix=/opt/hwloc-1.6.2 ********* * works * ********* openmpi-1.6.5 configured with : ./configure \ --prefix=/opt/openmpi-1.6.5_hwloc-1.5.2 \ --with-hwloc=/opt/hwloc-1.5.2 where hwloc-1.5.2 was configured with : ./configure --prefix=/opt/hwloc-1.5.2 ********* * works * ********* openmpi-1.6.5 configured with : ./configure \ --prefix=/opt/openmpi-1.6.5_hwloc-1.4.3 \ --with-hwloc=/opt/hwloc-1.4.3 where hwloc-1.4.3 was configured with : ./configure --prefix=/opt/hwloc-1.4.3 ********* * fails * ********* openmpi-1.6.5 configured with : ./configure \ --prefix=/opt/openmpi-1.6.5_hwloc-1.4.3_noxml2 \ --with-hwloc=/opt/hwloc-1.4.3_noxml2 where hwloc-1.4.3 was configured with : ./configure \ --prefix=/opt/hwloc-1.4.3_noxml2 \ --disable-libxml2 ********* * works * ********* Kevin - Thank you very much for the detailed analysis. Unfortunately this leaves us in a bit of a pickle. There is an ABI break from 1.4.3 to 1.5, despite no soname bump: http://upstream-tracker.org/versions/hwloc.html So I don't think we can just update hwloc in F18. I'm re-assigning to hwloc for now to get some more input. I don't know if this was a known issue that was addressed in hwloc and could perhaps get back-ported to 1.4.X. Hi Kevin, it works fine on my F18 notebook. In any case, hwloc 1.4.3 is the latest build from 1.4 stream so unfortunately I don't see any way how to fix this in F18. Please try to use Fedora 19 instead. In F19 hwloc version 1.7 is used. It should solve all the problems. Thanks a lot! Jirka This message is a reminder that Fedora 18 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 18. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '18'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 18's end of life. Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 18 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior to Fedora 18's end of life. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. |
Created attachment 775876 [details] A very simple MPI program Description of problem: On my home computer, which is a Lenovo S10, even the most simple mpi programs fail during MPI_Init. I noticed the problem on the VTK and ParaView dashboards after upgrading from Fedora 17 to 18. The simple program is attached. It compiles fine with : mpicc -g -o mpi_simple mpi_simple.c The compiled program runs on my work machine. My home machine can run non mpi programs with mpirun: mpirun -n 4 hostname murron.hobbs-hancock murron.hobbs-hancock murron.hobbs-hancock murron.hobbs-hancock but when I run the simple mpi program with : mpirun -n 1 mpi_simple I get an error that includes : orte_util_nidmap_init failed I'll attach the full output. I've tried running the program in gdb with : mpirun -n 1 gdb ./mpi_simple and setting a breakpoint at the offending call, and stepping from there, but I failed to learn anything except that there's a lot of unpacking in MPI setup. Version-Release number of selected component (if applicable): openmpi-1.6.3-7.fc18.x86_64 How reproducible: Always on one host. Steps to Reproduce: 1. mpicc -g -o mpi_simple mpi_simple.c 2. mpirun -n 1 mpi_simple 3. Actual results: Attached error message. Expected results: my rank is 0 of 1 Additional info: