967010 – Failed to open file in SPOOL on Execute node

Bug 967010 - Failed to open file in SPOOL on Execute node

Summary: Failed to open file in SPOOL on Execute node

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor
Sub Component:
Version:	Development
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	grid-maint-list
QA Contact:	MRG Quality Engineering
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-05-24 14:01 UTC by Martin Kudlej
Modified:	2016-05-26 19:29 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-05-26 19:29:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Martin Kudlej 2013-05-24 14:01:39 UTC

Description of problem:
I have this pool:

1st node(A=32bit RHEL5.9): CM, Scheduler, Execute node
Other nodes(B=64bit RHEL5.9, C=32bit RHEL5.9): Execute node only

I submit job(x.sub):
    universe=vanilla
    cmd=/bin/pwd
    output=out$(PROCESS).txt
    transfer_executable=false
    should_transfer_files=if_needed
    when_to_transfer_output=on_exit
    requirements=(FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED)
    queue

from node B to Scheduler on node A by command:
$ condor_submit -remote A x.sub

My goal is to submit job remotely from Execute node only to CM+Scheduler without transfering executable(because /bin/pwd is on all machines) and transfer std* if needed.

Jobs work OK on nodes A and C but it doesn't work on node B.
There is error in StarterLog:

05/24/13 14:30:19 Starting a VANILLA universe job with ID: 71.9
05/24/13 14:30:19 IWD: /var/lib/condor/spool/71/9/cluster71.proc9.subproc0
05/24/13 14:30:19 Failed to open '/var/lib/condor/spool/71/9/cluster71.proc9.subproc0/out9.txt' as standard output: No such file or directory (errno 2)
05/24/13 14:30:19 Failed to open some/all of the std files...
05/24/13 14:30:19 Aborting OsProc::StartJob.
05/24/13 14:30:19 Failed to start job, exiting
05/24/13 14:30:19 ShutdownFast all jobs.
05/24/13 14:30:19 condor_read() failed: recv(fd=6) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <ip:33251>.
05/24/13 14:30:19 IO: Failed to read packet header

If I put RemoteIwd to job classads, it works also on node B.
I think it should work also on node B WITHOUT RemoteIwd because Starter can detect there that there is no SPOOL on that machine because it is Execute node only.

Note: I've tried to set SSL authentication in pool to have same authenticated users on all machines. I've also tried to set same UidDomain for all nodes in pool. None of these experiments didn't help to solve problem.

Version-Release number of selected component (if applicable):
condor-7.8.8-0.4.1.el5
condor-aviary-7.8.8-0.4.1.el5
condor-classads-7.8.8-0.4.1.el5
condor-job-hooks-1.5-6.el5
condor-low-latency-1.2-3.el5
condor-qmf-7.8.8-0.4.1.el5
condor-wallaby-base-db-1.25-1.el5
condor-wallaby-client-5.0.5-2.el5
condor-wallaby-tools-5.0.5-2.el5
python-condorutils-1.5-6.el5
python-qpid-0.18-4.el5
python-qpid-qmf-0.18-15.el5
python-wallabyclient-5.0.5-2.el5
qpid-cpp-client-0.18-14.el5
qpid-cpp-client-devel-0.18-14.el5
qpid-cpp-server-0.18-14.el5
qpid-qmf-0.18-15.el5
qpid-qmf-devel-0.18-15.el5
qpid-tools-0.18-8.el5
ruby-condor-wallaby-5.0.5-2.el5
ruby-qpid-qmf-0.18-15.el5
ruby-wallaby-0.16.3-1.el5
wallaby-0.16.3-1.el5
wallaby-utils-0.16.3-1.el5


How reproducible:
100%

Actual results:
Jobs don't run on node B. 

Expected results:
Jobs will run on all nodes in pool with job classads described above.

Comment 4 Anne-Louise Tangring 2016-05-26 19:29:32 UTC

MRG-Grid is in maintenance and only customer escalations will be considered. This issue can be reopened if a customer escalation associated with it occurs.

Note You need to log in before you can comment on or make changes to this bug.