Red Hat Bugzilla – Bug 782132
openmpiscript - Command mpirun needs parameter --prefix for correct run (on RHEL 6.2)
Last modified: 2013-03-06 13:40:44 EST
Description of problem: In example file /usr/share/doc/condor-7.6.5/examples/openmpiscript on RHEL 6.2 is called command 'mpirun' without absolute path or parameter --prefix. Submitted job remain in running state and on one machine is in error output following message: # cat /var/lib/condor/execute/dir_31624/_condor_stderr summpi: error while loading shared libraries: libmpi_cxx.so.1: cannot open shared object file: No such file or directory summpi: error while loading shared libraries: libmpi_cxx.so.1: cannot open shared object file: No such file or directory Variable LD_LIBRARY_PATH is correctly set to: LD_LIBRARY_PATH=/usr/lib64/openmpi/lib (printed by openmpiscript just before calling mpirun command). When I set full path before the mpirun command (/usr/lib64/openmpi/bin/mpirun) or add parameter --prefix (--prefix /usr/lib64/openmpi) to that command, job is correctly pass and in output file is result. Version-Release number of selected component (if applicable): - RHEL 6.2 i386 and x86_64 # rpm -qa | grep condor condor-7.6.5-0.11.el6.x86_64 condor-classads-7.6.5-0.11.el6.x86_64 # rpm -qa | grep openmpi openmpi-devel-1.5.3-3.el6.x86_64 openmpi-1.5.3-3.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. Run parallel openmpi job with openmpiscript from examples mentioned above (see Bug 759433, comment 2). 2. Check error output file in /var/lib/condor/execute/dir_XXXX/_condor_stderr. Actual results: ... summpi: error while loading shared libraries: libmpi_cxx.so.1: cannot open shared object file: No such file or directory ... Expected results: No error about loading shared libraries. Additional info: About deprecated MCA parameter value plm_rsh_agent is Bug 772587. About executing mpirun via absolute path name and about --prefix parameter is: http://www.open-mpi.org/doc/v1.5/man1/mpirun.1.php#sect17
could you please attach your summpi example referenced above w/makefile for independent verification.
Created attachment 568830 [details] summpi.C, Makefile
Mine appears to work fine after the mod made in the openmpiscript for Bug 772587. The question is: Did you modify the 1st line in the script as noted in the comments? I will mark as modi, please verify with the afore mentioned fix and see if you can repro, please provide details if you are able to repro.
Created attachment 569308 [details] job output files (log, output,error); condor PU configuration; openmpi job files (summpi.C, openmpiscript, openmpi.job, Makefile) It's look like the problem is still there. Please check my configuration, submit file and openmpiscript in attachment, if I haven't there a mistake (it is for RHEL 6.2 i386, on x86_64 is it similar).
(In reply to comment #4) > It's look like the problem is still there. Exactly I realised, that the job finishes unlike to comment 0 and the error messge is in error output file (along to attachment in comment 4 in /tmp/parallel_job.$(cluster).$(process)-$(NODE).err). (I am not sure now why they remain in running state when I reported this.)
Made mods upstream to handle rhel6 behavior w/ --prefix.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: C: Running an open mpi job using the openmpiscript provided with condor on RHEL 6.2. C: The job would fail attempting to load shared libraries. F: Pass the --prefix of the install location of the open mpi to mpirun. R: Job runs successfully without error.
Tested on RHEL 6.4 - i386/x86_64 with: # rpm -q condor selinux-policy openmpi condor-7.8.8-0.3.el6.x86_64 selinux-policy-3.7.19-192.el6.noarch openmpi-1.5.4-1.el6.x86_64 It works correctly. >>> VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0564.html