Problems related to /usr/libexec/condor/sshd.sh from condor-7.4.1-0.7.el5. Problem #1 - sshd.sh's use of condor_chirp assumes the previous creation of a spool temp directory. In our shared NFS environment, we've worked with the folks to prevent schedd from making those temp directories. Problem #2 - I tested anyway working around problem #1 with a "condor_submit -spool", and the job experienced further problems. I've attached the spooled _condor_stderr file in the schedd area, if you look at line 103 you'll see that condor_chirp is getting a \001 character appended to the returned ClassAd value of EnteredCurrentStatus. This causes the next condor_chirp put to fail, failing the job in a hanging-like state. This seems to be related to bz 538436. Not sure if it qualifies as a dup.
Created attachment 379657 [details] stderr output
*** Bug 538436 has been marked as a duplicate of this bug. ***
Build in 7.4.3-0.21
How could I reproduce this bug? May you paste here test case or recipe to reproduce this issue, please?
Event posted on 08-02-2010 10:03am EDT by tgummels You should be able to reproduce with data I'll attach and the following command line: condor_submit -spool test.sdf This event sent from IssueTracker by tgummels issue 371984
I've tested this on RHEL 5.5 x86_64 with condor-7.4.4-0.14.el5 and I've got the same error in stderr output as in comment 1. 1. install condor with default configuration 2. set up dedicated scheduler: DedicatedScheduler = "DedicatedScheduler@--hostname--" START = Scheduler =?= $(DedicatedScheduler) SUSPEND = False CONTINUE = True PREEMPT = False KILL = False WANT_SUSPEND = False WANT_VACATE = False RANK = Scheduler =?= $(DedicatedScheduler) MPI_CONDOR_RSH_PATH = $(LIBEXEC) CONDOR_SSHD = /usr/sbin/sshd CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler 3. submit parallel job on one machine without transfer files(I've tried to simulate shared file system by not transfering files.): $cat mpi.sub: universe = parallel executable = /home/xxx/openmpi/ompiscript arguments = summpi log = logfile.$(NODE) output = outfile.$(NODE) error = errfile.$(NODE) machine_count = 1 getenv = true environment = LD_LIBRARY_PATH=/usr/lib64/openmpi/1.4-gcc/ queue $condor_submit -spool mpi.sub I've got ompiscript and other settings from https://bugzilla.redhat.com/show_bug.cgi?id=537232#c2 Am I doing anything wrong or is that bug still there?
Created attachment 447967 [details] error file
Martin, I'm not able to reproduce your failure case here, either with -spool or without. One thing to check is to make sure all of the paths in ompiscript are set properly (and that the scripts and Makefile aren't accidentally pulling down tools from an mpich installation). Some other things to consider: Is the spool directory getting created, or not? What attributes are set in the job ad?
I don't know where was problem, but not it works. I've used clean Condor installation. Tested with condor-7.4.1-0.7 and it doesn't work. Tested with condor-7.4.4-0.14 on RHEL 5.5/4.8 x i386/x86_64 and it works.--> VERIFIED
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, /usr/libexec/condor/sshd.sh's use of the condor_chirp daemon assumed the previous creation of a spool temp directory, which was not the desired behavior. With this update, 'sshd.sh' no longer requires a spool directory.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html