Bug 712024

Summary: condor_submit and shadow don't detect "short-named" absolute Windows path
Product: Red Hat Enterprise MRG Reporter: Martin Kudlej <mkudlej>
Component: condorAssignee: Timothy St. Clair <tstclair>
Status: CLOSED WORKSFORME QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.0CC: matt, tstclair
Target Milestone: 2.1   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-08-01 14:07:41 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 692416    

Description Martin Kudlej 2011-06-09 10:04:43 UTC
Description of problem:
submit file:
should_transfer_files=YES
executable='C:\mrg\ver.bat'
iwd='C:\users\admini~1.mku\appdata\local\temp'
log='c:\users\admini~1.mku\appdata\local\temp\mrg.log'
output='c:\users\admini~1.mku\appdata\local\temp\mrg.out'
universe=vanilla
arguments=1
error='c:\users\admini~1.mku\appdata\local\temp\mrg.err'
when_to_transfer_output=ON_EXIT
queue

Error message:
c:\mrg>C:\condor\bin\condor_submit.exe -name _scheduler_ _file_
Submitting job(s)
ERROR: No such directory: C:\mrg\'c:\users\admini~1.mku\appdata\local\temp\'

If I remove single and double quotes condor_submit submit file without error, but I see segmentation fault of Shadow:
06/09/11 04:04:09 Initializing a VANILLA shadow for job 359776.0
06/09/11 04:04:09 (359776.0) (18936): WriteUserLog::initialize: safe_open_wrapper("c:\users\admini~1.mku\appdata\local\temp/c:\users\admini~1.mku\appdata\local\temp\mrg_1.1.log") failed - errno 2 (No such file or directory)
06/09/11 04:04:09 (359776.0) (18936): WriteUserLog::initialize: failed to open file
06/09/11 04:04:09 (359776.0) (18936): Failed to initialize user log to c:\users\admini~1.mku\appdata\local\temp/c:\users\admini~1.mku\appdata\local\temp\mrg_1.1.log
06/09/11 04:04:09 (359776.0) (18936): Job 359776.0 going into Hold state (code 22,0): Failed to initialize user log to c:\users\admini~1.mku\appdata\local\temp/c:\users\admini~1.mku\appdata\local\temp\mrg_1.1.log
06/09/11 04:04:09 (359776.0) (18936): RemoteResource::killStarter(): DCStartd object NULL!
Stack dump for process 18936 at timestamp 1307606649 (13 frames)
condor_shadow(dprintf_dump_stack+0x4a)[0x817b1da]
condor_shadow[0x814e4f6]
[0xe7e420]
condor_shadow(_ZN10BaseShadow16updateJobInQueueE8update_t+0xb2)[0x80b0442]
condor_shadow(_ZN10BaseShadow7holdJobEPKcii+0x12b)[0x80b22cb]
condor_shadow(_ZN10BaseShadow11initUserLogEv+0x131)[0x80b2561]
condor_shadow(_ZN10BaseShadow8baseInitEPN14compat_classad7ClassAdEPKcS4_+0x2f9)[0x80b34c9]
condor_shadow(_ZN9UniShadow4initEPN14compat_classad7ClassAdEPKcS4_+0x41)[0x80a6711]
condor_shadow(_Z10initShadowPN14compat_classad7ClassAdE+0xd0)[0x80aab60]
condor_shadow(_Z11startShadowPN14compat_classad7ClassAdE+0x68)[0x80aacb8]
condor_shadow(main+0x1162)[0x80eae12]
/lib/libc.so.6(__libc_start_main+0xdc)[0x5f1e9c]
condor_shadow[0x80a5f31]

Similar situation is for similar QMF jobs(submitted from Windows):
$ condor_history -back -l 55159 | sort

args = "1"
AutoClusterAttrs = "ImageSize,JobUniverse,LastCheckpointPlatform,NumCkpts,JobStart,LastPeriodicCheckpoint,RequestCpus,RequestDisk,RequestMemory,Requirements,NiceUser,ConcurrencyLimits"
AutoClusterId = 15
ClusterId = 55159
cmd = "ver.bat"
CumulativeSlotTime = 7.000000
CurrentHosts = 0
CurrentTime = time()
EnteredCurrentStatus = 1307612747
Err = "c:\users\admini~1.mku\appdata\local\temp\mrg_1.1.err"
GlobalJobId = "_scheduler_#55159.0#1307612629"
ImageSize = 0
ImageSize_RAW = 0
iwd = "c:\"
JobCurrentStartDate = 1307612746
JobFinishedHookDone = 1307612748
JobLastStartDate = 1307612745
JobPrio = 0
JobRunCount = 69
JobStartDate = 1307612637
JobStatus = 3
JobUniverse = 5
LastJobLeaseRenewal = 1307612746
LastJobStatus = 1
LastMatchTime = 1307612746
LastPublicClaimId = "<_ip_:1068>#1307501598#424#..."
LastRemoteHost = "slot2@mkudlej_windows2003_64"
LastSuspensionTime = 0
MachineAttrCpus0 = 1
MachineAttrSlotWeight0 = 1
MaxHosts = 1
MinHosts = 1
MyType = "Job"
NumJobMatches = 69
NumShadowStarts = 69
OrigMaxHosts = 1
Out = "c:\users\admini~1.mku\appdata\local\temp\mrg_1.1.out"
owner = "condor"
ProcId = 0
QDate = 1307612629
RemoteUserCpu = 0.0
RemoteWallClockTime = 7.000000
RemoveReason = "via condor_rm (by user condor)"
requirements = ( FileSystemDomain =!= undefined ) && ( Arch =!= undefined ) && ( OpSys == "WINNT51" || OpSys == "WINNT52" || OpSys == "WINNT60" || OpSys == "WINNT61" )
ShouldTransferFiles = "NO"
should_transfer_files = "YES"
StartdPrincipal = "unauthenticated@unmapped/_ip_"
Submission = "_scheduler_#55159"
TargetType = "Machine"
User = "condor@_broker_"
UserLog = "c:\users\admini~1.mku\appdata\local\temp\mrg_1.1.log"
when_to_transfer_output = "ON_EXIT"



$ tail ShadowLog
06/09/11 05:45:48 Setting maximum accepts per cycle 4.
06/09/11 05:45:48 Initializing a VANILLA shadow for job 55148.0
06/09/11 05:45:48 (55148.0) (30261): WriteUserLog::initialize: safe_open_wrapper("c:\/c:\users\admini~1.mku\appdata\local\temp\mrg_1.1.log") failed - errno 2 (No such file or directory)
06/09/11 05:45:48 (55148.0) (30261): WriteUserLog::initialize: failed to open file
06/09/11 05:45:48 (55148.0) (30261): Failed to initialize user log to c:\/c:\users\admini~1.mku\appdata\local\temp\mrg_1.1.log
06/09/11 05:45:48 (55148.0) (30261): Job 55148.0 going into Hold state (code 22,0): Failed to initialize user log to c:\/c:\users\admini~1.mku\appdata\local\temp\mrg_1.1.log
06/09/11 05:45:48 (55148.0) (30261): RemoteResource::killStarter(): DCStartd object NULL!
Stack dump for process 30261 at timestamp 1307612748 (13 frames)
condor_shadow(dprintf_dump_stack+0x44)[0x81322d4]
condor_shadow[0x815ae87]
[0xed6400]
condor_shadow(_ZN10BaseShadow16updateJobInQueueE8update_t+0xb1)[0x80bac81]
condor_shadow(_ZN10BaseShadow7holdJobEPKcii+0x131)[0x80bc651]
condor_shadow(_ZN10BaseShadow11initUserLogEv+0x113)[0x80bc863]
condor_shadow(_ZN10BaseShadow8baseInitEPN14compat_classad7ClassAdEPKcS4_+0x2cb)[0x80bd56b]
condor_shadow(_ZN9UniShadow4initEPN14compat_classad7ClassAdEPKcS4_+0x31)[0x80ad831]
condor_shadow(_Z10initShadowPN14compat_classad7ClassAdE+0xcd)[0x80b2c1d]
condor_shadow(_Z11startShadowPN14compat_classad7ClassAdE+0x62)[0x80b2d92]
condor_shadow(main+0x13a5)[0x80e3045]
/lib/libc.so.6(__libc_start_main+0xe6)[0x585cc6]
condor_shadow[0x80a20f1]

I've read https://bugzilla.redhat.com/show_bug.cgi?id=610265#c6 and according this table submitted classads are ok and starter should execute job in "EXECUTE/exec". All paths are absolute but Shadow doesn't recognized them as absolute. So shadow crashes:

$ gdb `which condor_shadow` core.30331
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-48.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/condor_shadow...Reading symbols from /usr/lib/debug/usr/sbin/condor_shadow.debug...done.
done.
[New Thread 30331]
Missing separate debuginfo for 
Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/23/e5a71140d5c8345a1c915447c466c23f43dc02
Reading symbols from /lib/libdl-2.12.so...Reading symbols from /usr/lib/debug/lib/libdl-2.12.so.debug...done.
done.
Loaded symbols for /lib/libdl-2.12.so
Reading symbols from /usr/lib/libclassad.so.1.1.0...Reading symbols from /usr/lib/debug/usr/lib/libclassad.so.1.1.0.debug...done.
done.
Loaded symbols for /usr/lib/libclassad.so.1.1.0
Reading symbols from /lib/libexpat.so.1.5.2...Reading symbols from /usr/lib/debug/lib/libexpat.so.1.5.2.debug...done.
done.
Loaded symbols for /lib/libexpat.so.1.5.2
Reading symbols from /lib/libpcre.so.0.0.1...Reading symbols from /usr/lib/debug/lib/libpcre.so.0.0.1.debug...done.
done.
Loaded symbols for /lib/libpcre.so.0.0.1
Reading symbols from /usr/lib/libssl.so.1.0.0...Reading symbols from /usr/lib/debug/usr/lib/libssl.so.1.0.0.debug...done.
done.
Loaded symbols for /usr/lib/libssl.so.1.0.0
Reading symbols from /usr/lib/libcrypto.so.1.0.0...Reading symbols from /usr/lib/debug/usr/lib/libcrypto.so.1.0.0.debug...done.
done.
Loaded symbols for /usr/lib/libcrypto.so.1.0.0
Reading symbols from /lib/libkrb5.so.3.3...Reading symbols from /usr/lib/debug/lib/libkrb5.so.3.3.debug...done.
done.
Loaded symbols for /lib/libkrb5.so.3.3
Reading symbols from /lib/libcom_err.so.2.1...Reading symbols from /usr/lib/debug/lib/libcom_err.so.2.1.debug...done.
done.
Loaded symbols for /lib/libcom_err.so.2.1
Reading symbols from /lib/libk5crypto.so.3.1...Reading symbols from /usr/lib/debug/lib/libk5crypto.so.3.1.debug...done.
done.
Loaded symbols for /lib/libk5crypto.so.3.1
Reading symbols from /lib/libkrb5support.so.0.1...Reading symbols from /usr/lib/debug/lib/libkrb5support.so.0.1.debug...done.
done.
Loaded symbols for /lib/libkrb5support.so.0.1
Reading symbols from /usr/lib/libstdc++.so.6.0.13...Reading symbols from /usr/lib/debug/usr/lib/libstdc++.so.6.0.13.debug...done.
done.
Loaded symbols for /usr/lib/libstdc++.so.6.0.13
Reading symbols from /lib/libm-2.12.so...Reading symbols from /usr/lib/debug/lib/libm-2.12.so.debug...done.
done.
Loaded symbols for /lib/libm-2.12.so
Reading symbols from /lib/libgcc_s-4.4.5-20110214.so.1...Reading symbols from /usr/lib/debug/lib/libgcc_s-4.4.5-20110214.so.1.debug...done.
done.
Loaded symbols for /lib/libgcc_s-4.4.5-20110214.so.1
Reading symbols from /lib/libpthread-2.12.so...Reading symbols from /usr/lib/debug/lib/libpthread-2.12.so.debug...done.
[Thread debugging using libthread_db enabled]
done.
Loaded symbols for /lib/libpthread-2.12.so
Reading symbols from /lib/libc-2.12.so...Reading symbols from /usr/lib/debug/lib/libc-2.12.so.debug...done.
done.
Loaded symbols for /lib/libc-2.12.so
Reading symbols from /lib/ld-2.12.so...Reading symbols from /usr/lib/debug/lib/ld-2.12.so.debug...done.
done.
Loaded symbols for /lib/ld-2.12.so
Reading symbols from /lib/libgssapi_krb5.so.2.2...Reading symbols from /usr/lib/debug/lib/libgssapi_krb5.so.2.2.debug...done.
done.
Loaded symbols for /lib/libgssapi_krb5.so.2.2
Reading symbols from /lib/libresolv-2.12.so...Reading symbols from /usr/lib/debug/lib/libresolv-2.12.so.debug...done.
done.
Loaded symbols for /lib/libresolv-2.12.so
Reading symbols from /lib/libz.so.1.2.3...Reading symbols from /usr/lib/debug/lib/libz.so.1.2.3.debug...done.
done.
Loaded symbols for /lib/libz.so.1.2.3
Reading symbols from /lib/libkeyutils.so.1.3...Reading symbols from /usr/lib/debug/lib/libkeyutils.so.1.3.debug...done.
done.
Loaded symbols for /lib/libkeyutils.so.1.3
Reading symbols from /lib/libselinux.so.1...Reading symbols from /usr/lib/debug/lib/libselinux.so.1.debug...done.
done.
Loaded symbols for /lib/libselinux.so.1
Reading symbols from /lib/libnss_files-2.12.so...Reading symbols from /usr/lib/debug/lib/libnss_files-2.12.so.debug...done.
done.
Loaded symbols for /lib/libnss_files-2.12.so
Reading symbols from /lib/libnss_dns-2.12.so...Reading symbols from /usr/lib/debug/lib/libnss_dns-2.12.so.debug...done.
done.
Loaded symbols for /lib/libnss_dns-2.12.so
Core was generated by `condor_shadow -f 55160.0 --schedd=<_ip_:43223> --xfer-queue=limit=upload'.
Program terminated with signal 11, Segmentation fault.
#0  0x00a73424 in __kernel_vsyscall ()
(gdb) thread apply all bt

Thread 1 (Thread 0xb7820750 (LWP 30331)):
#0  0x00a73424 in __kernel_vsyscall ()
#1  0x007167e0 in raise (sig=11) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
#2  0x0815aed4 in sig_backtrace_handler (signum=11) at /usr/src/debug/condor-7.6.0/src/condor_utils/dprintf_config.cpp:75
#3  <signal handler called>
#4  0x080bac81 in BaseShadow::updateJobInQueue (this=0x85ab258, type=U_HOLD) at /usr/src/debug/condor-7.6.0/src/condor_shadow.V6.1/baseshadow.cpp:1177
#5  0x080bc651 in BaseShadow::holdJob (this=0x85ab258, reason=0x8596c98 "Failed to initialize user log to c:\\/c:\\users\\admini~1.mku\\appdata\\local\\temp\\mrg_1.1.log", hold_reason_code=22, 
    hold_reason_subcode=0) at /usr/src/debug/condor-7.6.0/src/condor_shadow.V6.1/baseshadow.cpp:422
#6  0x080bc863 in BaseShadow::initUserLog (this=0x85ab258) at /usr/src/debug/condor-7.6.0/src/condor_shadow.V6.1/baseshadow.cpp:826
#7  0x080bd56b in BaseShadow::baseInit (this=0x85ab258, job_ad=0x85a8f08, schedd_addr=0xbfcb5ccf "<ip:49370>", xfer_queue_contact_info=0xbfcb5ca0 "limit=upload,download;addr=<ip:49370>")
    at /usr/src/debug/condor-7.6.0/src/condor_shadow.V6.1/baseshadow.cpp:159
#8  0x080ad831 in UniShadow::init (this=0x85ab258, job_ad=0x85a8f08, schedd_addr=0xbfcb5ccf "<ip:49370>", xfer_queue_contact_info=0xbfcb5ca0 "limit=upload,download;addr=<ip:49370>")
    at /usr/src/debug/condor-7.6.0/src/condor_shadow.V6.1/shadow.cpp:102
#9  0x080b2c1d in initShadow (ad=0x85a8f08) at /usr/src/debug/condor-7.6.0/src/condor_shadow.V6.1/shadow_v61_main.cpp:272
#10 0x080b2d92 in startShadow (ad=0x85a8f08) at /usr/src/debug/condor-7.6.0/src/condor_shadow.V6.1/shadow_v61_main.cpp:292
#11 0x080e3045 in main (argc=6, argv=0xbfcb5ad8) at /usr/src/debug/condor-7.6.0/src/condor_daemon_core.V6/daemon_core_main.cpp:2374


Version-Release number of selected component (if applicable):
condor-7.6.1-0.10
condor-win-7.6.1-0.11

How reproducible:
100%

Steps to Reproduce:
1. setup pool: Linux1 - CM, Sched, Exec; Linux2 - Sched, Exec; Windows - Exec
2. disable authentication with claimtobe
3. add all users who will submit from Windows to Linux machine
4. Submit Windows job from Windows 
  
Actual results:
Condor_shadow doesn't recognize absolute path.

Expected results:
Condor will recognize absolute path and there will be NO corefiles in $(LOG) directory.

Comment 1 Timothy St. Clair 2011-06-09 13:29:27 UTC
I think this is b/c you are using windows short names "~".. I use full absolute paths all the time. 

Either way we should probably support windows short named paths.

Comment 3 Timothy St. Clair 2011-07-08 18:43:34 UTC
Short names should be avoided when possible because of conflicts with the CLASSAD language.  The correct method would be to quote, but even then, it's likely not the best solution.  

I think the best method will be to through an error not allowing short names during submit and force the user to specify the full path.

Comment 4 Timothy St. Clair 2011-07-08 20:27:39 UTC
<retract last comment>

Can not repro with latest build (condor-7.6.3-0.1) using:

Error=C:\condor\tests\UUUUUU~1.FOO\mrg_$(Cluster).$(Process).err
Output=C:\condor\tests\UUUUUU~1.FOO\mrg_$(Cluster).$(Process).out
Log=C:\condor\tests\UUUUUU~1.FOO\mrg_$(Cluster).$(Process).log

condor_submit -remote my_schedd my.sub

---------------------------------------------------------------
Could you please provide repro info with the latest build.