Bug 608027

Summary: cannot transfer files back to submit machine
Product: Red Hat Enterprise MRG Reporter: Martin Kudlej <mkudlej>
Component: condorAssignee: Timothy St. Clair <tstclair>
Status: CLOSED DEFERRED QA Contact: Martin Kudlej <mkudlej>
Severity: medium Docs Contact:
Priority: high    
Version: 1.3CC: alyoung, matt, tstclair
Target Milestone: 1.3.2   
Target Release: ---   
Hardware: All   
OS: Windows   
Whiteboard:
Fixed In Version: condor-7.4.5-0.2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-02-15 15:23:05 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ls output, SchedLog, job description file, job executable, condor_transfer_data output none

Description Martin Kudlej 2010-06-25 12:47:23 UTC
Description of problem:
I've tried to transfer files back to submit machine as is described in BZ607938 and I cannot do it because of authentication problem.
Condor on submit machine is set to Claimtobe authentication and also CM+Scheduler is set to Claimtobe. Error:
$ condor_tranfer_data.exe -name <schedd name> <clusterid>
Fetching data files...
AUTHENTICATE: 1002:Failure performing handshake
ERROR: Failed to spool job files

The same output is there if I try also -pool argument.

Please see configuration files in BZ607938.

Comment 1 Martin Kudlej 2010-06-25 13:04:49 UTC
Just fro notice I've submitted test job from BZ607938 without "+LeaveInjobQueue=False".

Comment 2 Matthew Farrellee 2010-06-27 12:40:51 UTC
Please enable D_SECURITY for the Schedd (SCHEDD_DEBUG) and for condor_transfer_data (via TOOL_DEBUG), and check the SchedLog for information as to why the connection is being DENIED.

Comment 3 Martin Kudlej 2010-06-28 13:45:19 UTC
I've set SCHEDD_DEBUG = D_SECURITY on CM and TOOL_DEBUG = D_ALL on both machines.
There are output from ls in cluster directory, SchedLog, output from condor_transfer_data, job description file and job executable file in attachment.

Comment 4 Martin Kudlej 2010-06-28 13:46:27 UTC
Created attachment 427424 [details]
ls output, SchedLog, job description file, job executable, condor_transfer_data output

Comment 5 Martin Kudlej 2010-06-29 08:55:01 UTC
If I submit same job from Linux machine, I can transfer files back to submit machine.

Comment 6 Timothy St. Clair 2010-06-29 15:06:33 UTC
set 
SEC_DEFAULT_AUTHENTICATION_METHODS = CLAIMTOBE
in your condor_config.local on windows

Also, set aside a 3rd party location like C:\Temp not windows\temp for your files and verify the user submitting has write permissions to this location.

There are several issues with capturing stdout & stderr on windows so you may want to have your bat file spew to a txt file and transfer ON_EXIT.

Under these conditions all is well in my env.  

Please attempt to repo error with the settings listed above.

Comment 7 Timothy St. Clair 2010-06-29 16:00:17 UTC
When traversing your logs I also noticed that you submitted as "Administrator" which is not a default account on your Linux box.  You will want to create an account on your windows machine with the same name/credz on your Linux schedd.  When using CLAIMTOBE it will try to run as Classad[Owner] on your Linux box and if that account doesn't exist then it will FAIL to write to the spool directories due to permissions issues.

Comment 8 Timothy St. Clair 2010-06-29 16:53:57 UTC
One final note:  I do see a failure to transfer job files if the run is still in progress (which I consider normal).  Once the jobs have been marked "C" the transfer should succeed.

Comment 9 Martin Kudlej 2010-06-30 08:57:48 UTC
I've always tried to transfer files back after jobs are completed.
There are 
SEC_DEFAULT_AUTHENTICATION_METHODS = CLAIMTOBE
SEC_CLIENT_AUTHENTICATION_METHODS = CLAIMTOBE
on Windows and also on Linux CM+Scheduler machines.

Timothy wrote:
You will want to create an
account on your windows machine with the same name/credz on your Linux schedd. 
When using CLAIMTOBE it will try to run as Classad[Owner] on your Linux box and
if that account doesn't exist then it will FAIL to write to the spool
directories due to permissions issues.

Job starts and runs and is able to write to spool directory on Linux CM. There is ls output in attachment where you can see that there is not empty stdout for that job in spool directory on Linux.
I run all tools as Administrator so there should not be any problem with rights on Windows side. I've tried also c:\temp with sharing and full rights for everyone and it doesn't work. 

If I've created user Administrator on Linux CM+Scheduler machine it works.

But I think that if authentication is disabled and user Administrator can submit and run jobs on some machines in pool, then Administrator should be able to transfer results from CM+Scheduler to machine from which Administrator has submitted that job without account on CM.
As I can see in log file it is just question of existence of user with the same name on CM as on remote submit machine. It doesn't depend on authentication:
DC_AUTHENTICATE: received DC_AUTHENTICATE from <:3273>
DC_AUTHENTICATE: added incoming session id :23402:1277886744:4571 to cache for 80 seconds (lease is 3620s, return address is unknown).
DC_AUTHENTICATE: Success.
PERMISSION GRANTED to unauthenticated user from host  for command 489 (TRANSFER_DATA_WITH_PERMS), access level WRITE: reason: WRITE authorization policy allows access by anyone
HANDSHAKE: in handshake(my_methods = 'Claimtobe')
HANDSHAKE: handshake() - i am the server
HANDSHAKE: client sent (methods == 2)
HANDSHAKE: i picked (method == 2)
HANDSHAKE: client received (method == 2)
Authentication was a Success.
ZKM: setting default map to Administrator
ZKM: post-map: current user is 'Administrator'
ZKM: post-map: current domain is '(null)'
ZKM: post-map: current FQU is 'Administrator'
PERMISSION GRANTED to Administrator from host  for queue management, access level WRITE: reason: WRITE authorization policy allows access by anyone
PERMISSION GRANTED to Administrator from host  for queue management, access level WRITE: reason: WRITE authorization policy allows access by anyone
The submitting job ad as the FileTransferObject sees it
...
ReliSock: put_file: Failed to open file /var/lib/condor/spool/cluster330.proc0.subproc0/_condor_stderr, errno = 13.
DoUpload: (Condor error code 13, subcode 13) SCHEDD at 10.34.33.58 failed to send file(s) to <:3273>: error reading from /var/lib/condor/spool/cluster330.proc0.subproc0/_condor_stderr: (errno 13) Permission denied; TOOL failed to receive file(s) from <1:44461>
generalJobFilesWorkerThread(): failed to transfer files for job 330.0
ERROR - Staging of job files failed!

Comment 10 Timothy St. Clair 2010-06-30 13:35:18 UTC
How are you submitting your jobs?

condor_submit -name mrg27.lab.bos.redhat.com -spool 

when jobs are completed.

condor_tranfer_data.exe -name <schedd name> <clusterid>

My data transfers are fine.

Comment 12 Timothy St. Clair 2010-06-30 14:05:42 UTC
Upstream tracking ticket: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1478

Comment 14 Timothy St. Clair 2010-12-01 18:56:41 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Fixed upstream in stable series.

Comment 15 Matthew Farrellee 2010-12-01 19:07:23 UTC
Deleted Technical Notes Contents.

Old Contents:
Fixed upstream in stable series.