Bug 967010 - Failed to open file in SPOOL on Execute node
Failed to open file in SPOOL on Execute node
Status: CLOSED WONTFIX
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor (Show other bugs)
Development
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: grid-maint-list
MRG Quality Engineering
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-05-24 10:01 EDT by Martin Kudlej
Modified: 2016-05-26 15:29 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-05-26 15:29:32 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Martin Kudlej 2013-05-24 10:01:39 EDT
Description of problem:
I have this pool:

1st node(A=32bit RHEL5.9): CM, Scheduler, Execute node
Other nodes(B=64bit RHEL5.9, C=32bit RHEL5.9): Execute node only

I submit job(x.sub):
    universe=vanilla
    cmd=/bin/pwd
    output=out$(PROCESS).txt
    transfer_executable=false
    should_transfer_files=if_needed
    when_to_transfer_output=on_exit
    requirements=(FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED)
    queue

from node B to Scheduler on node A by command:
$ condor_submit -remote A x.sub

My goal is to submit job remotely from Execute node only to CM+Scheduler without transfering executable(because /bin/pwd is on all machines) and transfer std* if needed.

Jobs work OK on nodes A and C but it doesn't work on node B.
There is error in StarterLog:

05/24/13 14:30:19 Starting a VANILLA universe job with ID: 71.9
05/24/13 14:30:19 IWD: /var/lib/condor/spool/71/9/cluster71.proc9.subproc0
05/24/13 14:30:19 Failed to open '/var/lib/condor/spool/71/9/cluster71.proc9.subproc0/out9.txt' as standard output: No such file or directory (errno 2)
05/24/13 14:30:19 Failed to open some/all of the std files...
05/24/13 14:30:19 Aborting OsProc::StartJob.
05/24/13 14:30:19 Failed to start job, exiting
05/24/13 14:30:19 ShutdownFast all jobs.
05/24/13 14:30:19 condor_read() failed: recv(fd=6) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <ip:33251>.
05/24/13 14:30:19 IO: Failed to read packet header

If I put RemoteIwd to job classads, it works also on node B.
I think it should work also on node B WITHOUT RemoteIwd because Starter can detect there that there is no SPOOL on that machine because it is Execute node only.

Note: I've tried to set SSL authentication in pool to have same authenticated users on all machines. I've also tried to set same UidDomain for all nodes in pool. None of these experiments didn't help to solve problem.

Version-Release number of selected component (if applicable):
condor-7.8.8-0.4.1.el5
condor-aviary-7.8.8-0.4.1.el5
condor-classads-7.8.8-0.4.1.el5
condor-job-hooks-1.5-6.el5
condor-low-latency-1.2-3.el5
condor-qmf-7.8.8-0.4.1.el5
condor-wallaby-base-db-1.25-1.el5
condor-wallaby-client-5.0.5-2.el5
condor-wallaby-tools-5.0.5-2.el5
python-condorutils-1.5-6.el5
python-qpid-0.18-4.el5
python-qpid-qmf-0.18-15.el5
python-wallabyclient-5.0.5-2.el5
qpid-cpp-client-0.18-14.el5
qpid-cpp-client-devel-0.18-14.el5
qpid-cpp-server-0.18-14.el5
qpid-qmf-0.18-15.el5
qpid-qmf-devel-0.18-15.el5
qpid-tools-0.18-8.el5
ruby-condor-wallaby-5.0.5-2.el5
ruby-qpid-qmf-0.18-15.el5
ruby-wallaby-0.16.3-1.el5
wallaby-0.16.3-1.el5
wallaby-utils-0.16.3-1.el5


How reproducible:
100%

Actual results:
Jobs don't run on node B. 

Expected results:
Jobs will run on all nodes in pool with job classads described above.
Comment 4 Anne-Louise Tangring 2016-05-26 15:29:32 EDT
MRG-Grid is in maintenance and only customer escalations will be considered. This issue can be reopened if a customer escalation associated with it occurs.

Note You need to log in before you can comment on or make changes to this bug.