+++ This bug was initially created as a clone of Bug #538199 +++ Created an attachment (id=369962) Test case using Allreduce Description of problem: Regression using MPI::Allreduce with more than 4 processes. I'm working an 8 core workstation, and when I run the attached code with more that 4 processes, it hangs. Generally around iteration 100, but always less than 200. I have detected no pattern to the number of iterations. Version-Release number of selected component (if applicable): openmpi-1.3.3-6.fc11.x86_64 FAILS! openmpi-1.3.1-1.fc11.x86_64 WORKS! How reproducible: Every time. Steps to Reproduce: 1. Update to openmpi-1.3.3-6.fc11.x86_64 2. make allreduce CXX=mpic++ CXXFLAGS="-O0 -g" 3. mpiexec -n 8 ./allreduce Actual results: Stops before reaching 99999. Expected results: All numbers from 0 to 99999 printed. Additional info: I haven't tried other collective functions, so others may fail as well. I also haven't tried other architectures or running on a true cluster. --- Additional comment from mmh on 2009-11-19 23:09:03 EST --- Sounds like what I've been seeing on Fedora 11 and 12 (and RHEL 5.4) with openmpi >= 1.3.2. But it is not just with allreduce. These random hangs are easy to reproduce with the Intel MPI benchmarks. Corresponding openmpi ticket is probably this one https://svn.open-mpi.org/trac/ompi/ticket/2043 --- Additional comment from rainy6144 on 2010-01-09 05:04:28 EST --- The r22324 patch mentioned in the openmpi ticket seems to fix the problem for me. Looks like a bug in the use of gcc inline assembly. --- Additional comment from dledford on 2010-01-13 13:25:26 EST --- I added the r22324 patch to the openmpi-1.4-1 build.
Cloned from a Fedora bug, but this needs fixed in the current rhel5.5 openmpi package in the openib errata.
# mpirun -n 8 ./allreduce_cpp <snip> 99949 99950 99951 99952 99953 99954 99955 99956 99957 99958 99959 99960 99961 99962 99963 99964 99965 99966 99967 99968 99969 99970 99971 99972 99973 99974 99975 99976 99977 99978 99979 99980 99981 99982 99983 99984 99985 99986 99987 99988 99989 99990 99991 99992 99993 99994 99995 99996 99997 99998 99999 #
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0292.html