Bug 1235044 - OpenMPI programs hang when reading from redirected stdin
Summary: OpenMPI programs hang when reading from redirected stdin
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: openmpi
Version: 24
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Michal Schmidt
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-06-23 20:39 UTC by Shane Hart
Modified: 2016-07-09 23:53 UTC (History)
6 users (show)

Fixed In Version: openmpi-1.10.3-2.fc24 openmpi-1.8.8-5.fc23.2
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-07-05 04:59:00 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Test Fortran program (374 bytes, text/plain)
2015-06-23 20:39 UTC, Shane Hart
no flags Details

Description Shane Hart 2015-06-23 20:39:23 UTC
Created attachment 1042492 [details]
Test Fortran program

Description of problem:

When using the OpenMPI provided with Fedora 22 (1.8.4-5.20150228gitgd83fb30.fc22) Fortran programs (maybe others) fail to read from a redirected stdin due to another process locking the file descriptor (possibly Torque?).

Version-Release number of selected component (if applicable):

openmpi-devel 1.8.4-5.20150228gitgd83fb30.fc22 

How reproducible:

See a simple Fortran program that is attached.  Load the default OpenMPI module:

$ module load mpi/openmpi-x86_64

Compile:

$ mpif90 -o test test.f90

Create a file with some text and run test program:

mpirun -np 1 ./test < testfile


Actual results:

$ mpirun -np 1 ./test < testfile
[warn] Epoll ADD(1) on fd 0 failed.  Old events were 0; read change was 1 (add); write change was 0 (none): Operation not permitted
 I am the master node!
<HUNG>

Backtrace with GDB:

(gdb) bt
#0  0x00007f3ce996c533 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f3cea277858 in epoll_dispatch () from /lib64/libevent-2.0.so.5
#2  0x00007f3cea262d4a in event_base_loop () from /lib64/libevent-2.0.so.5
#3  0x00007f3ceb714d67 in orterun ()
#4  0x00007f3ceb7135b0 in main ()

Expected results:

Running without using mpirun:

$ ./test < testfile
 I am the master node!
 str: This should be printed.


Additional info:

I compiled OpenMPI v1.8.6 from their website myself and repeated above tests.  Everything works fine.  From some googling around, it seems that something has fd 0 open and it could be Torque (other people complained about Slurm).  This sounds reasonable because the system OpenMPI is compiled with Torque support, and the one I compiled myself isn't.

Comment 1 Orion Poplawski 2015-06-24 03:12:29 UTC
Can you try with https://admin.fedoraproject.org/updates/FEDORA-2015-7476/openmpi-1.8.5-1.fc22 ?

Comment 2 Shane Hart 2015-06-24 13:54:59 UTC
I installed that package from the updates-testing repo, recompiled, and ran.  Got the same problem:

[hts@hts-fedora-pc ~/work/mpi_test]$ mpirun --version
mpirun (Open MPI) 1.8.5

Report bugs to http://www.open-mpi.org/community/help/
[hts@hts-fedora-pc ~/work/mpi_test]$ mpif90 -o test test.f90
[hts@hts-fedora-pc ~/work/mpi_test]$ mpirun -np 1 ./test < testfile
[warn] Epoll ADD(1) on fd 0 failed.  Old events were 0; read change was 1 (add); write change was 0 (none): Operation not permitted
 I am the master node!

Comment 3 Feng Yu 2016-02-24 22:52:35 UTC
Is there any updates on this bug? 

I ran into the same issue from the stock version of openmpi in Fedora 23. no Torque or Slurm here.

Easiest way to reproduce:

mpirun -n 2 cat <<EOF
> a
> EOF

Comment 4 Feng Yu 2016-02-24 22:53:33 UTC
BTW, the same error even if I run as root.

Comment 5 Petr Stodulka 2016-03-30 20:46:36 UTC
Similar result I get when I redirect output to file.

Comment 6 Shane Hart 2016-06-24 12:05:27 UTC
This bug is still present on Fedora 24.

Comment 7 Shane Hart 2016-06-24 15:15:14 UTC
I downloaded the SRPM for openmpi and played around with the .spec file.  Removing the configure option:

--with-libevent=/usr

and therefore using the built-in libevent solves the problem.

Comment 8 Michal Schmidt 2016-06-24 15:31:44 UTC
That's an interesting observation.

The bundled libevent 2.0.21 has a few OpenMPI modifications, but it's not obvious if any are related to this bug.
libevent 2.0.22 is available, but not yet in Fedora (bug 1179206).

Comment 9 Fedora Update System 2016-06-27 16:50:40 UTC
openmpi-1.8.8-5.fc23.2 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-6bd09ffe01

Comment 10 Fedora Update System 2016-06-27 16:50:49 UTC
openmpi-1.10.3-2.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-f3a5dce707

Comment 11 Shane Hart 2016-06-27 17:02:27 UTC
I installed the updated package from Koji and using the bundled libevent fixes this bug.

Comment 12 Fedora Update System 2016-06-28 15:22:25 UTC
openmpi-1.10.3-2.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-f3a5dce707

Comment 13 Fedora Update System 2016-06-28 15:53:42 UTC
openmpi-1.8.8-5.fc23.2 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-6bd09ffe01

Comment 14 Fedora Update System 2016-07-05 04:58:57 UTC
openmpi-1.10.3-2.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.

Comment 15 Fedora Update System 2016-07-09 23:53:54 UTC
openmpi-1.8.8-5.fc23.2 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.