Bug 759433

Summary: OpenMPI job fails when sshd.sh putting identity keys back.
Product: Red Hat Enterprise MRG Reporter: Daniel Horák <dahorak>
Component: condorAssignee: Timothy St. Clair <tstclair>
Status: CLOSED ERRATA QA Contact: Daniel Horák <dahorak>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 2.1CC: ltoscano, matt, mkudlej, tstclair
Target Milestone: 2.1.1   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: condor-7.6.5-0.9 Doc Type: Bug Fix
Doc Text:
C: Run an OpenMPI/parallel universe job C: condor_chirp will fail to write file F: condor_chirp was using relative paths vs. absolute R: Parallel universe jobs run to completion
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-02-06 18:18:29 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 765607    
Attachments:
Description Flags
Configuration and OpenMPI job (to comment 0) none

Description Daniel Horák 2011-12-02 11:33:31 UTC
Description of problem:
  OpenMPI job submited to parallel universe fails when condor_chirp putting identity keys back.

Version-Release number of selected component (if applicable):
  condor-7.6.5-0.8.el5.i386

How reproducible:
  100%

Steps to Reproduce:
1. Setup parallel universe (see configuration file in attachment).
2. Submit OpenMPI job included in attachment 
    (it is the same as in bug 537232 comment 2)
  - openmpiscript is customised from actual version of 
      /usr/share/doc/condor-7.6.5/examples/openmpiscript 
3. After job finish, check output and error files of the job.
  
Actual results:
  # cat /tmp/mpi_outfile.0 
    error 0 chirp putting identity keys back
  # cat /tmp/mpi_errfile.0
    chirp: couldn't putfile: No such file or directory
    /usr/libexec/condor/sshd.sh: line 69:  3991 Aborted                 $CONDOR_CHIRP put -perm 0700 $idkey $_CONDOR_REMOTE_SPOOL_DIR/$_CONDOR_PROCNO.key

Expected results:
  No error in mentioned files, correctly launched OpenMPI job.

Additional info:
  About 0 printed as error code in output message is bug 759154.
  About selinux disallowing ssh keys generation is bug 759403.

Am I doing anything wrong?

Comment 1 Daniel Horák 2011-12-02 11:35:45 UTC
After small probing it's look like condor_chirp don't like absolute path for remote file.
If I change this line in /usr/libexec/condor/sshd.sh (around line 69):
  $CONDOR_CHIRP put -perm 0700 $idkey $_CONDOR_REMOTE_SPOOL_DIR/$_CONDOR_PROCNO.key
to:
  $CONDOR_CHIRP put -perm 0700 $idkey $_CONDOR_PROCNO.key
key is correctly putted to central manager machine (to /var/lib/condor/0.key.

Comment 2 Daniel Horák 2011-12-02 12:58:39 UTC
Created attachment 539618 [details]
Configuration and OpenMPI job (to comment 0)

Comment 3 Timothy St. Clair 2011-12-12 19:44:06 UTC
Could you verify this exists in condor-7.6.5-0.9.

This could be related to https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2630.  Which should be in the afore mentioned build.

Comment 4 Daniel Horák 2011-12-13 08:51:19 UTC
On RHEL 5.7 i386 with condor-7.6.5-0.9.el5.i386 it is OK (ssh keys are correctly putted to CM).

Comment 5 Timothy St. Clair 2011-12-13 15:08:19 UTC
Fixed upstream.

Comment 7 Timothy St. Clair 2011-12-14 18:07:59 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: Run an OpenMPI/parallel universe job
C: condor_chirp will fail to write file 
F: condor_chirp was using relative paths vs. absolute
R: Parallel universe jobs run to completion

Comment 9 Daniel Horák 2012-01-10 13:45:34 UTC
Verified on all platforms: RHEL 5.7 and RHEL 6.2 - i386 and x86_64:
  - identity keys are correctly putted back,
  - in output and error file is no error (relevant to this BZ).

Comment 10 errata-xmlrpc 2012-02-06 18:18:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0100.html