Bug 549432 - Parallel Universe jobs require job spool directory
Summary: Parallel Universe jobs require job spool directory
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.2
Hardware: All
OS: Linux
medium
medium
Target Milestone: 1.3
: ---
Assignee: Will Benton
QA Contact: Martin Kudlej
URL:
Whiteboard:
: 538436 (view as bug list)
Depends On:
Blocks: 537232
TreeView+ depends on / blocked
 
Reported: 2009-12-21 17:11 UTC by Jon Thomas
Modified: 2018-11-14 17:58 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, /usr/libexec/condor/sshd.sh's use of the condor_chirp daemon assumed the previous creation of a spool temp directory, which was not the desired behavior. With this update, 'sshd.sh' no longer requires a spool directory.
Clone Of:
Environment:
Last Closed: 2010-10-14 16:15:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
stderr output (677.41 KB, text/plain)
2009-12-21 17:13 UTC, Jon Thomas
no flags Details
error file (46.25 KB, application/octet-stream)
2010-09-17 10:29 UTC, Martin Kudlej
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0773 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Messaging and Grid Version 1.3 2010-10-14 15:56:44 UTC

Description Jon Thomas 2009-12-21 17:11:42 UTC
Problems related to /usr/libexec/condor/sshd.sh from condor-7.4.1-0.7.el5.

Problem #1 - sshd.sh's use of condor_chirp assumes the previous creation of a spool temp directory.  In our shared NFS environment, we've worked with the folks to prevent schedd from making those temp directories.

Problem #2 - I tested anyway working around problem #1 with a "condor_submit -spool", and the job experienced further problems.  I've attached the spooled _condor_stderr file in the schedd area, if you look at line 103 you'll see that condor_chirp is getting a \001 character appended to the returned ClassAd value of EnteredCurrentStatus.  This causes the next condor_chirp put to fail, failing the job in a hanging-like state. 

This seems to be related to bz 538436. Not sure if it qualifies as a dup.

Comment 1 Jon Thomas 2009-12-21 17:13:01 UTC
Created attachment 379657 [details]
stderr output

Comment 4 Matthew Farrellee 2010-04-12 22:10:46 UTC
*** Bug 538436 has been marked as a duplicate of this bug. ***

Comment 5 Matthew Farrellee 2010-06-21 11:33:11 UTC
Build in 7.4.3-0.21

Comment 6 Martin Kudlej 2010-08-02 12:26:41 UTC
How could I reproduce this bug? May you paste here test case or recipe to reproduce this issue, please?

Comment 7 Issue Tracker 2010-08-02 14:03:18 UTC
Event posted on 08-02-2010 10:03am EDT by tgummels

You should be able to reproduce with data I'll attach and the following
command line:

condor_submit -spool test.sdf



This event sent from IssueTracker by tgummels 
 issue 371984

Comment 11 Martin Kudlej 2010-09-17 10:28:27 UTC
I've tested this on RHEL 5.5 x86_64 with condor-7.4.4-0.14.el5 and I've got the same error in stderr output as in comment 1.

1. install condor with default configuration
2. set up dedicated scheduler:
DedicatedScheduler = "DedicatedScheduler@--hostname--"
START           = Scheduler =?= $(DedicatedScheduler)
SUSPEND = False
CONTINUE        = True
PREEMPT = False
KILL            = False
WANT_SUSPEND    = False
WANT_VACATE     = False
RANK            = Scheduler =?= $(DedicatedScheduler)
MPI_CONDOR_RSH_PATH = $(LIBEXEC)
CONDOR_SSHD = /usr/sbin/sshd
CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler

3. submit parallel job on one machine without transfer files(I've tried to simulate shared file system by not transfering files.):
$cat mpi.sub:
universe = parallel
executable = /home/xxx/openmpi/ompiscript
arguments = summpi
log = logfile.$(NODE)
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 1
getenv = true
environment = LD_LIBRARY_PATH=/usr/lib64/openmpi/1.4-gcc/
queue

$condor_submit -spool mpi.sub

I've got ompiscript and other settings from https://bugzilla.redhat.com/show_bug.cgi?id=537232#c2

Am I doing anything wrong or is that bug still there?

Comment 12 Martin Kudlej 2010-09-17 10:29:28 UTC
Created attachment 447967 [details]
error file

Comment 13 Will Benton 2010-09-20 21:25:08 UTC
Martin, I'm not able to reproduce your failure case here, either with -spool or without.  One thing to check is to make sure all of the paths in ompiscript are set properly (and that the scripts and Makefile aren't accidentally pulling down tools from an mpich installation).

Some other things to consider:  Is the spool directory getting created, or not?  What attributes are set in the job ad?

Comment 14 Martin Kudlej 2010-09-22 09:39:53 UTC
I don't know where was problem, but not it works. I've used clean Condor installation.

Tested with condor-7.4.1-0.7 and it doesn't work.

Tested with condor-7.4.4-0.14 on RHEL 5.5/4.8 x i386/x86_64 and it works.--> VERIFIED

Comment 15 Martin Prpič 2010-10-07 15:31:09 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, /usr/libexec/condor/sshd.sh's use of the condor_chirp daemon assumed the previous creation of a spool temp directory, which was not the desired behavior. With this update, 'sshd.sh' no longer requires a spool directory.

Comment 17 errata-xmlrpc 2010-10-14 16:15:34 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html


Note You need to log in before you can comment on or make changes to this bug.