Bug 1235044

Summary: OpenMPI programs hang when reading from redirected stdin
Product: [Fedora] Fedora Reporter: Shane Hart <hartsw>
Component: openmpiAssignee: Michal Schmidt <mschmidt>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 24CC: dakingun, dledford, orion, pstodulk, rainwoodman, shart6
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openmpi-1.10.3-2.fc24 openmpi-1.8.8-5.fc23.2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-07-05 04:59:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Test Fortran program none

Description Shane Hart 2015-06-23 20:39:23 UTC
Created attachment 1042492 [details]
Test Fortran program

Description of problem:

When using the OpenMPI provided with Fedora 22 (1.8.4-5.20150228gitgd83fb30.fc22) Fortran programs (maybe others) fail to read from a redirected stdin due to another process locking the file descriptor (possibly Torque?).

Version-Release number of selected component (if applicable):

openmpi-devel 1.8.4-5.20150228gitgd83fb30.fc22 

How reproducible:

See a simple Fortran program that is attached.  Load the default OpenMPI module:

$ module load mpi/openmpi-x86_64

Compile:

$ mpif90 -o test test.f90

Create a file with some text and run test program:

mpirun -np 1 ./test < testfile


Actual results:

$ mpirun -np 1 ./test < testfile
[warn] Epoll ADD(1) on fd 0 failed.  Old events were 0; read change was 1 (add); write change was 0 (none): Operation not permitted
 I am the master node!
<HUNG>

Backtrace with GDB:

(gdb) bt
#0  0x00007f3ce996c533 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f3cea277858 in epoll_dispatch () from /lib64/libevent-2.0.so.5
#2  0x00007f3cea262d4a in event_base_loop () from /lib64/libevent-2.0.so.5
#3  0x00007f3ceb714d67 in orterun ()
#4  0x00007f3ceb7135b0 in main ()

Expected results:

Running without using mpirun:

$ ./test < testfile
 I am the master node!
 str: This should be printed.


Additional info:

I compiled OpenMPI v1.8.6 from their website myself and repeated above tests.  Everything works fine.  From some googling around, it seems that something has fd 0 open and it could be Torque (other people complained about Slurm).  This sounds reasonable because the system OpenMPI is compiled with Torque support, and the one I compiled myself isn't.

Comment 1 Orion Poplawski 2015-06-24 03:12:29 UTC
Can you try with https://admin.fedoraproject.org/updates/FEDORA-2015-7476/openmpi-1.8.5-1.fc22 ?

Comment 2 Shane Hart 2015-06-24 13:54:59 UTC
I installed that package from the updates-testing repo, recompiled, and ran.  Got the same problem:

[hts@hts-fedora-pc ~/work/mpi_test]$ mpirun --version
mpirun (Open MPI) 1.8.5

Report bugs to http://www.open-mpi.org/community/help/
[hts@hts-fedora-pc ~/work/mpi_test]$ mpif90 -o test test.f90
[hts@hts-fedora-pc ~/work/mpi_test]$ mpirun -np 1 ./test < testfile
[warn] Epoll ADD(1) on fd 0 failed.  Old events were 0; read change was 1 (add); write change was 0 (none): Operation not permitted
 I am the master node!

Comment 3 Feng Yu 2016-02-24 22:52:35 UTC
Is there any updates on this bug? 

I ran into the same issue from the stock version of openmpi in Fedora 23. no Torque or Slurm here.

Easiest way to reproduce:

mpirun -n 2 cat <<EOF
> a
> EOF

Comment 4 Feng Yu 2016-02-24 22:53:33 UTC
BTW, the same error even if I run as root.

Comment 5 Petr Stodulka 2016-03-30 20:46:36 UTC
Similar result I get when I redirect output to file.

Comment 6 Shane Hart 2016-06-24 12:05:27 UTC
This bug is still present on Fedora 24.

Comment 7 Shane Hart 2016-06-24 15:15:14 UTC
I downloaded the SRPM for openmpi and played around with the .spec file.  Removing the configure option:

--with-libevent=/usr

and therefore using the built-in libevent solves the problem.

Comment 8 Michal Schmidt 2016-06-24 15:31:44 UTC
That's an interesting observation.

The bundled libevent 2.0.21 has a few OpenMPI modifications, but it's not obvious if any are related to this bug.
libevent 2.0.22 is available, but not yet in Fedora (bug 1179206).

Comment 9 Fedora Update System 2016-06-27 16:50:40 UTC
openmpi-1.8.8-5.fc23.2 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-6bd09ffe01

Comment 10 Fedora Update System 2016-06-27 16:50:49 UTC
openmpi-1.10.3-2.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-f3a5dce707

Comment 11 Shane Hart 2016-06-27 17:02:27 UTC
I installed the updated package from Koji and using the bundled libevent fixes this bug.

Comment 12 Fedora Update System 2016-06-28 15:22:25 UTC
openmpi-1.10.3-2.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-f3a5dce707

Comment 13 Fedora Update System 2016-06-28 15:53:42 UTC
openmpi-1.8.8-5.fc23.2 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-6bd09ffe01

Comment 14 Fedora Update System 2016-07-05 04:58:57 UTC
openmpi-1.10.3-2.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.

Comment 15 Fedora Update System 2016-07-09 23:53:54 UTC
openmpi-1.8.8-5.fc23.2 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.