Red Hat Bugzilla – Bug 572574
Error reported from execute node incomplete for IWD access failure
Last modified: 2010-10-14 12:06:46 EDT
Description of problem: A failure to access a job's IWD during execution reports a confusing error message in the job's HoldReason. Version-Release number of selected component (if applicable): All up to and including condor 7.4.3-0.4 How reproducible: 100% Steps to Reproduce: 1. 2 machine setup: 1) schedd, 2) startd 2. mkdir /tmp/wontexist 3. echo -e "cmd=/bin/true\niwd=/tmp/wontexist\nrequirements=Machine=!=\"$HOSTNAME\"\nqueue" | condor_submit 4. let the job go to H[eld] in condor_q 5. condor_q -l | grep ^HoldReason and observe "Error from slot1@startd-machine: Failed to execute '/bin/true': No such file or directory 6. on startd machine look in /var/log/condor/StarterLog.slot1 and observe: Create_Process: Cannot access specified cwd "/tmp/wontexist": errno = 2 (No such file or directory) ERROR "Create_Process(/bin/true,, ...) failed: No such file or directory" at line 530 in file os_proc.cpp Expected results: The HoldReason to include information about access to cwd (say iwd!) failing.
A search of condor-wiki found #1015, which is related. http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1015 "2. The error message should be improved."
pushed candidate fix to branch V7_4-BZ572574-misleading-create-process-iwd-err-msg
For test setup I used (my local machine and reserved lab machine as execute node), ended up needing to use following command for submission: % echo -e "cmd=/bin/true\nremoteiwd=/tmp\nshould_transfer_files=true\nwhen_to_transfer_output=ON_EXIT\ntransfer_executable=true\nqueue" | condor_submit Also used following edits to local config on execute machine: CONDOR_HOST = <IP-of-my-local-machine-via-vpn> CCB_ADDRESS = $(COLLECTOR_HOST) PRIVATE_NETWORK_NAME = "network_name" SEC_DEFAULT_AUTHENTICATION_METHODS = CLAIMTOBE
Pushed an alternative fix based on MyString: V7_4-BZ572574-iwd-err-msg-MyString See also: gt#1361
The HoldReason now explicitly says that the directory specified as Iwd does not exist (see #1 and #3). Verified on RHEL 4.8/5.5, i386/x86_64. condor-7.4.3-0.21
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, confusing error messages were printed when accessing a job's IWD failed during execution. The messages are corrected and the issue is resolved.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html