Bug 555159 - openmpi-1.3.3-6.fc11.x86_64 Allreduce regression.
Summary: openmpi-1.3.3-6.fc11.x86_64 Allreduce regression.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openmpi
Version: 5.4
Hardware: x86_64
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Doug Ledford
QA Contact: Red Hat Kernel QE team
URL: https://svn.open-mpi.org/trac/ompi/ti...
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-01-13 20:17 UTC by Doug Ledford
Modified: 2010-03-30 08:56 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 538199
Environment:
Last Closed: 2010-03-30 08:56:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2010:0292 0 normal SHIPPED_LIVE openib bug fix and enhancement update 2010-03-29 15:16:09 UTC

Description Doug Ledford 2010-01-13 20:17:01 UTC
+++ This bug was initially created as a clone of Bug #538199 +++

Created an attachment (id=369962)
Test case using Allreduce

Description of problem:
Regression using MPI::Allreduce with more than 4 processes.

I'm working an 8 core workstation, and when I run the attached code with more that 4 processes, it hangs.  Generally around iteration 100, but always less than 200.  I have detected no pattern to the number of iterations.

Version-Release number of selected component (if applicable):
openmpi-1.3.3-6.fc11.x86_64 FAILS!
openmpi-1.3.1-1.fc11.x86_64 WORKS!

How reproducible:
Every time.

Steps to Reproduce:
1. Update to openmpi-1.3.3-6.fc11.x86_64
2. make allreduce CXX=mpic++ CXXFLAGS="-O0 -g"
3. mpiexec -n 8 ./allreduce
  
Actual results:
Stops before reaching 99999.

Expected results:
All numbers from 0 to 99999 printed.

Additional info:
I haven't tried other collective functions, so others may fail as well.  I also haven't tried other architectures or running on a true cluster.

--- Additional comment from mmh on 2009-11-19 23:09:03 EST ---

Sounds like what I've been seeing on Fedora 11 and 12 (and RHEL 5.4) with openmpi >= 1.3.2.  But it is not just with allreduce. 

These random hangs are easy to reproduce with the Intel MPI benchmarks.

Corresponding openmpi ticket is probably this one
https://svn.open-mpi.org/trac/ompi/ticket/2043

--- Additional comment from rainy6144 on 2010-01-09 05:04:28 EST ---

The r22324 patch mentioned in the openmpi ticket seems to fix the problem for me.  Looks like a bug in the use of gcc inline assembly.

--- Additional comment from dledford on 2010-01-13 13:25:26 EST ---

I added the r22324 patch to the openmpi-1.4-1 build.

Comment 1 Doug Ledford 2010-01-13 20:17:59 UTC
Cloned from a Fedora bug, but this needs fixed in the current rhel5.5 openmpi package in the openib errata.

Comment 3 Gurhan Ozen 2010-03-15 22:14:07 UTC
# mpirun -n 8 ./allreduce_cpp 
<snip>
99949 99950 99951 99952 99953 99954 99955 99956 99957 99958 99959 99960 99961 99962 99963 99964 99965 99966 99967 99968 99969 99970 99971 99972 99973 99974 99975 99976 99977 99978 99979 99980 99981 99982 99983 99984 99985 99986 99987 99988 99989 99990 99991 99992 99993 99994 99995 99996 99997 99998 99999 #

Comment 5 errata-xmlrpc 2010-03-30 08:56:51 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0292.html


Note You need to log in before you can comment on or make changes to this bug.