549432 – Parallel Universe jobs require job spool directory

Bug 549432 - Parallel Universe jobs require job spool directory

Summary: Parallel Universe jobs require job spool directory

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor
Sub Component:
Version:	1.2
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	1.3
Target Release:	---
Assignee:	Will Benton
QA Contact:	Martin Kudlej
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	538436 (view as bug list)
Depends On:
Blocks:	537232
TreeView+	depends on / blocked

Reported:	2009-12-21 17:11 UTC by Jon Thomas
Modified:	2018-11-14 17:58 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, /usr/libexec/condor/sshd.sh's use of the condor_chirp daemon assumed the previous creation of a spool temp directory, which was not the desired behavior. With this update, 'sshd.sh' no longer requires a spool directory.
Clone Of:
Environment:
Last Closed:	2010-10-14 16:15:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
stderr output (677.41 KB, text/plain) 2009-12-21 17:13 UTC, Jon Thomas	no flags	Details
error file (46.25 KB, application/octet-stream) 2010-09-17 10:29 UTC, Martin Kudlej	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2010:0773	0	normal	SHIPPED_LIVE	Moderate: Red Hat Enterprise MRG Messaging and Grid Version 1.3	2010-10-14 15:56:44 UTC

Description Jon Thomas 2009-12-21 17:11:42 UTC

Problems related to /usr/libexec/condor/sshd.sh from condor-7.4.1-0.7.el5.

Problem #1 - sshd.sh's use of condor_chirp assumes the previous creation of a spool temp directory.  In our shared NFS environment, we've worked with the folks to prevent schedd from making those temp directories.

Problem #2 - I tested anyway working around problem #1 with a "condor_submit -spool", and the job experienced further problems.  I've attached the spooled _condor_stderr file in the schedd area, if you look at line 103 you'll see that condor_chirp is getting a \001 character appended to the returned ClassAd value of EnteredCurrentStatus.  This causes the next condor_chirp put to fail, failing the job in a hanging-like state. 

This seems to be related to bz 538436. Not sure if it qualifies as a dup.

Comment 1 Jon Thomas 2009-12-21 17:13:01 UTC

Created attachment 379657 [details]
stderr output

Comment 4 Matthew Farrellee 2010-04-12 22:10:46 UTC

*** Bug 538436 has been marked as a duplicate of this bug. ***

Comment 5 Matthew Farrellee 2010-06-21 11:33:11 UTC

Build in 7.4.3-0.21

Comment 6 Martin Kudlej 2010-08-02 12:26:41 UTC

How could I reproduce this bug? May you paste here test case or recipe to reproduce this issue, please?

Comment 7 Issue Tracker 2010-08-02 14:03:18 UTC

Event posted on 08-02-2010 10:03am EDT by tgummels

You should be able to reproduce with data I'll attach and the following
command line:

condor_submit -spool test.sdf



This event sent from IssueTracker by tgummels 
 issue 371984

Comment 11 Martin Kudlej 2010-09-17 10:28:27 UTC

I've tested this on RHEL 5.5 x86_64 with condor-7.4.4-0.14.el5 and I've got the same error in stderr output as in comment 1.

1. install condor with default configuration
2. set up dedicated scheduler:
DedicatedScheduler = "DedicatedScheduler@--hostname--"
START           = Scheduler =?= $(DedicatedScheduler)
SUSPEND = False
CONTINUE        = True
PREEMPT = False
KILL            = False
WANT_SUSPEND    = False
WANT_VACATE     = False
RANK            = Scheduler =?= $(DedicatedScheduler)
MPI_CONDOR_RSH_PATH = $(LIBEXEC)
CONDOR_SSHD = /usr/sbin/sshd
CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler

3. submit parallel job on one machine without transfer files(I've tried to simulate shared file system by not transfering files.):
$cat mpi.sub:
universe = parallel
executable = /home/xxx/openmpi/ompiscript
arguments = summpi
log = logfile.$(NODE)
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 1
getenv = true
environment = LD_LIBRARY_PATH=/usr/lib64/openmpi/1.4-gcc/
queue

$condor_submit -spool mpi.sub

I've got ompiscript and other settings from https://bugzilla.redhat.com/show_bug.cgi?id=537232#c2

Am I doing anything wrong or is that bug still there?

Comment 12 Martin Kudlej 2010-09-17 10:29:28 UTC

Created attachment 447967 [details]
error file

Comment 13 Will Benton 2010-09-20 21:25:08 UTC

Martin, I'm not able to reproduce your failure case here, either with -spool or without.  One thing to check is to make sure all of the paths in ompiscript are set properly (and that the scripts and Makefile aren't accidentally pulling down tools from an mpich installation).

Some other things to consider:  Is the spool directory getting created, or not?  What attributes are set in the job ad?

Comment 14 Martin Kudlej 2010-09-22 09:39:53 UTC

I don't know where was problem, but not it works. I've used clean Condor installation.

Tested with condor-7.4.1-0.7 and it doesn't work.

Tested with condor-7.4.4-0.14 on RHEL 5.5/4.8 x i386/x86_64 and it works.--> VERIFIED

Comment 15 Martin Prpič 2010-10-07 15:31:09 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, /usr/libexec/condor/sshd.sh's use of the condor_chirp daemon assumed the previous creation of a spool temp directory, which was not the desired behavior. With this update, 'sshd.sh' no longer requires a spool directory.

Comment 17 errata-xmlrpc 2010-10-14 16:15:34 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html

Note You need to log in before you can comment on or make changes to this bug.