Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 549432 - Parallel Universe jobs require job spool directory
Parallel Universe jobs require job spool directory
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor (Show other bugs)
1.2
All Linux
medium Severity medium
: 1.3
: ---
Assigned To: Will Benton
Martin Kudlej
:
: 538436 (view as bug list)
Depends On:
Blocks: 537232
  Show dependency treegraph
 
Reported: 2009-12-21 12:11 EST by Jon Thomas
Modified: 2010-10-14 12:15 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, /usr/libexec/condor/sshd.sh's use of the condor_chirp daemon assumed the previous creation of a spool temp directory, which was not the desired behavior. With this update, 'sshd.sh' no longer requires a spool directory.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-10-14 12:15:34 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
stderr output (677.41 KB, text/plain)
2009-12-21 12:13 EST, Jon Thomas
no flags Details
error file (46.25 KB, application/octet-stream)
2010-09-17 06:29 EDT, Martin Kudlej
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0773 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Messaging and Grid Version 1.3 2010-10-14 11:56:44 EDT

  None (edit)
Description Jon Thomas 2009-12-21 12:11:42 EST
Problems related to /usr/libexec/condor/sshd.sh from condor-7.4.1-0.7.el5.

Problem #1 - sshd.sh's use of condor_chirp assumes the previous creation of a spool temp directory.  In our shared NFS environment, we've worked with the folks to prevent schedd from making those temp directories.

Problem #2 - I tested anyway working around problem #1 with a "condor_submit -spool", and the job experienced further problems.  I've attached the spooled _condor_stderr file in the schedd area, if you look at line 103 you'll see that condor_chirp is getting a \001 character appended to the returned ClassAd value of EnteredCurrentStatus.  This causes the next condor_chirp put to fail, failing the job in a hanging-like state. 

This seems to be related to bz 538436. Not sure if it qualifies as a dup.
Comment 1 Jon Thomas 2009-12-21 12:13:01 EST
Created attachment 379657 [details]
stderr output
Comment 4 Matthew Farrellee 2010-04-12 18:10:46 EDT
*** Bug 538436 has been marked as a duplicate of this bug. ***
Comment 5 Matthew Farrellee 2010-06-21 07:33:11 EDT
Build in 7.4.3-0.21
Comment 6 Martin Kudlej 2010-08-02 08:26:41 EDT
How could I reproduce this bug? May you paste here test case or recipe to reproduce this issue, please?
Comment 7 Issue Tracker 2010-08-02 10:03:18 EDT
Event posted on 08-02-2010 10:03am EDT by tgummels

You should be able to reproduce with data I'll attach and the following
command line:

condor_submit -spool test.sdf



This event sent from IssueTracker by tgummels 
 issue 371984
Comment 11 Martin Kudlej 2010-09-17 06:28:27 EDT
I've tested this on RHEL 5.5 x86_64 with condor-7.4.4-0.14.el5 and I've got the same error in stderr output as in comment 1.

1. install condor with default configuration
2. set up dedicated scheduler:
DedicatedScheduler = "DedicatedScheduler@--hostname--"
START           = Scheduler =?= $(DedicatedScheduler)
SUSPEND = False
CONTINUE        = True
PREEMPT = False
KILL            = False
WANT_SUSPEND    = False
WANT_VACATE     = False
RANK            = Scheduler =?= $(DedicatedScheduler)
MPI_CONDOR_RSH_PATH = $(LIBEXEC)
CONDOR_SSHD = /usr/sbin/sshd
CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler

3. submit parallel job on one machine without transfer files(I've tried to simulate shared file system by not transfering files.):
$cat mpi.sub:
universe = parallel
executable = /home/xxx/openmpi/ompiscript
arguments = summpi
log = logfile.$(NODE)
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 1
getenv = true
environment = LD_LIBRARY_PATH=/usr/lib64/openmpi/1.4-gcc/
queue

$condor_submit -spool mpi.sub

I've got ompiscript and other settings from https://bugzilla.redhat.com/show_bug.cgi?id=537232#c2

Am I doing anything wrong or is that bug still there?
Comment 12 Martin Kudlej 2010-09-17 06:29:28 EDT
Created attachment 447967 [details]
error file
Comment 13 Will Benton 2010-09-20 17:25:08 EDT
Martin, I'm not able to reproduce your failure case here, either with -spool or without.  One thing to check is to make sure all of the paths in ompiscript are set properly (and that the scripts and Makefile aren't accidentally pulling down tools from an mpich installation).

Some other things to consider:  Is the spool directory getting created, or not?  What attributes are set in the job ad?
Comment 14 Martin Kudlej 2010-09-22 05:39:53 EDT
I don't know where was problem, but not it works. I've used clean Condor installation.

Tested with condor-7.4.1-0.7 and it doesn't work.

Tested with condor-7.4.4-0.14 on RHEL 5.5/4.8 x i386/x86_64 and it works.--> VERIFIED
Comment 15 Martin Prpič 2010-10-07 11:31:09 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, /usr/libexec/condor/sshd.sh's use of the condor_chirp daemon assumed the previous creation of a spool temp directory, which was not the desired behavior. With this update, 'sshd.sh' no longer requires a spool directory.
Comment 17 errata-xmlrpc 2010-10-14 12:15:34 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html

Note You need to log in before you can comment on or make changes to this bug.