Bug 712024 - condor_submit and shadow don't detect "short-named" absolute Windows path
Summary: condor_submit and shadow don't detect "short-named" absolute Windows path
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 2.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: 2.1
: ---
Assignee: Timothy St. Clair
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks: 692416
TreeView+ depends on / blocked
 
Reported: 2011-06-09 10:04 UTC by Martin Kudlej
Modified: 2011-08-31 12:26 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-08-01 14:07:41 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Martin Kudlej 2011-06-09 10:04:43 UTC
Description of problem:
submit file:
should_transfer_files=YES
executable='C:\mrg\ver.bat'
iwd='C:\users\admini~1.mku\appdata\local\temp'
log='c:\users\admini~1.mku\appdata\local\temp\mrg.log'
output='c:\users\admini~1.mku\appdata\local\temp\mrg.out'
universe=vanilla
arguments=1
error='c:\users\admini~1.mku\appdata\local\temp\mrg.err'
when_to_transfer_output=ON_EXIT
queue

Error message:
c:\mrg>C:\condor\bin\condor_submit.exe -name _scheduler_ _file_
Submitting job(s)
ERROR: No such directory: C:\mrg\'c:\users\admini~1.mku\appdata\local\temp\'

If I remove single and double quotes condor_submit submit file without error, but I see segmentation fault of Shadow:
06/09/11 04:04:09 Initializing a VANILLA shadow for job 359776.0
06/09/11 04:04:09 (359776.0) (18936): WriteUserLog::initialize: safe_open_wrapper("c:\users\admini~1.mku\appdata\local\temp/c:\users\admini~1.mku\appdata\local\temp\mrg_1.1.log") failed - errno 2 (No such file or directory)
06/09/11 04:04:09 (359776.0) (18936): WriteUserLog::initialize: failed to open file
06/09/11 04:04:09 (359776.0) (18936): Failed to initialize user log to c:\users\admini~1.mku\appdata\local\temp/c:\users\admini~1.mku\appdata\local\temp\mrg_1.1.log
06/09/11 04:04:09 (359776.0) (18936): Job 359776.0 going into Hold state (code 22,0): Failed to initialize user log to c:\users\admini~1.mku\appdata\local\temp/c:\users\admini~1.mku\appdata\local\temp\mrg_1.1.log
06/09/11 04:04:09 (359776.0) (18936): RemoteResource::killStarter(): DCStartd object NULL!
Stack dump for process 18936 at timestamp 1307606649 (13 frames)
condor_shadow(dprintf_dump_stack+0x4a)[0x817b1da]
condor_shadow[0x814e4f6]
[0xe7e420]
condor_shadow(_ZN10BaseShadow16updateJobInQueueE8update_t+0xb2)[0x80b0442]
condor_shadow(_ZN10BaseShadow7holdJobEPKcii+0x12b)[0x80b22cb]
condor_shadow(_ZN10BaseShadow11initUserLogEv+0x131)[0x80b2561]
condor_shadow(_ZN10BaseShadow8baseInitEPN14compat_classad7ClassAdEPKcS4_+0x2f9)[0x80b34c9]
condor_shadow(_ZN9UniShadow4initEPN14compat_classad7ClassAdEPKcS4_+0x41)[0x80a6711]
condor_shadow(_Z10initShadowPN14compat_classad7ClassAdE+0xd0)[0x80aab60]
condor_shadow(_Z11startShadowPN14compat_classad7ClassAdE+0x68)[0x80aacb8]
condor_shadow(main+0x1162)[0x80eae12]
/lib/libc.so.6(__libc_start_main+0xdc)[0x5f1e9c]
condor_shadow[0x80a5f31]

Similar situation is for similar QMF jobs(submitted from Windows):
$ condor_history -back -l 55159 | sort

args = "1"
AutoClusterAttrs = "ImageSize,JobUniverse,LastCheckpointPlatform,NumCkpts,JobStart,LastPeriodicCheckpoint,RequestCpus,RequestDisk,RequestMemory,Requirements,NiceUser,ConcurrencyLimits"
AutoClusterId = 15
ClusterId = 55159
cmd = "ver.bat"
CumulativeSlotTime = 7.000000
CurrentHosts = 0
CurrentTime = time()
EnteredCurrentStatus = 1307612747
Err = "c:\users\admini~1.mku\appdata\local\temp\mrg_1.1.err"
GlobalJobId = "_scheduler_#55159.0#1307612629"
ImageSize = 0
ImageSize_RAW = 0
iwd = "c:\"
JobCurrentStartDate = 1307612746
JobFinishedHookDone = 1307612748
JobLastStartDate = 1307612745
JobPrio = 0
JobRunCount = 69
JobStartDate = 1307612637
JobStatus = 3
JobUniverse = 5
LastJobLeaseRenewal = 1307612746
LastJobStatus = 1
LastMatchTime = 1307612746
LastPublicClaimId = "<_ip_:1068>#1307501598#424#..."
LastRemoteHost = "slot2@mkudlej_windows2003_64"
LastSuspensionTime = 0
MachineAttrCpus0 = 1
MachineAttrSlotWeight0 = 1
MaxHosts = 1
MinHosts = 1
MyType = "Job"
NumJobMatches = 69
NumShadowStarts = 69
OrigMaxHosts = 1
Out = "c:\users\admini~1.mku\appdata\local\temp\mrg_1.1.out"
owner = "condor"
ProcId = 0
QDate = 1307612629
RemoteUserCpu = 0.0
RemoteWallClockTime = 7.000000
RemoveReason = "via condor_rm (by user condor)"
requirements = ( FileSystemDomain =!= undefined ) && ( Arch =!= undefined ) && ( OpSys == "WINNT51" || OpSys == "WINNT52" || OpSys == "WINNT60" || OpSys == "WINNT61" )
ShouldTransferFiles = "NO"
should_transfer_files = "YES"
StartdPrincipal = "unauthenticated@unmapped/_ip_"
Submission = "_scheduler_#55159"
TargetType = "Machine"
User = "condor@_broker_"
UserLog = "c:\users\admini~1.mku\appdata\local\temp\mrg_1.1.log"
when_to_transfer_output = "ON_EXIT"



$ tail ShadowLog
06/09/11 05:45:48 Setting maximum accepts per cycle 4.
06/09/11 05:45:48 Initializing a VANILLA shadow for job 55148.0
06/09/11 05:45:48 (55148.0) (30261): WriteUserLog::initialize: safe_open_wrapper("c:\/c:\users\admini~1.mku\appdata\local\temp\mrg_1.1.log") failed - errno 2 (No such file or directory)
06/09/11 05:45:48 (55148.0) (30261): WriteUserLog::initialize: failed to open file
06/09/11 05:45:48 (55148.0) (30261): Failed to initialize user log to c:\/c:\users\admini~1.mku\appdata\local\temp\mrg_1.1.log
06/09/11 05:45:48 (55148.0) (30261): Job 55148.0 going into Hold state (code 22,0): Failed to initialize user log to c:\/c:\users\admini~1.mku\appdata\local\temp\mrg_1.1.log
06/09/11 05:45:48 (55148.0) (30261): RemoteResource::killStarter(): DCStartd object NULL!
Stack dump for process 30261 at timestamp 1307612748 (13 frames)
condor_shadow(dprintf_dump_stack+0x44)[0x81322d4]
condor_shadow[0x815ae87]
[0xed6400]
condor_shadow(_ZN10BaseShadow16updateJobInQueueE8update_t+0xb1)[0x80bac81]
condor_shadow(_ZN10BaseShadow7holdJobEPKcii+0x131)[0x80bc651]
condor_shadow(_ZN10BaseShadow11initUserLogEv+0x113)[0x80bc863]
condor_shadow(_ZN10BaseShadow8baseInitEPN14compat_classad7ClassAdEPKcS4_+0x2cb)[0x80bd56b]
condor_shadow(_ZN9UniShadow4initEPN14compat_classad7ClassAdEPKcS4_+0x31)[0x80ad831]
condor_shadow(_Z10initShadowPN14compat_classad7ClassAdE+0xcd)[0x80b2c1d]
condor_shadow(_Z11startShadowPN14compat_classad7ClassAdE+0x62)[0x80b2d92]
condor_shadow(main+0x13a5)[0x80e3045]
/lib/libc.so.6(__libc_start_main+0xe6)[0x585cc6]
condor_shadow[0x80a20f1]

I've read https://bugzilla.redhat.com/show_bug.cgi?id=610265#c6 and according this table submitted classads are ok and starter should execute job in "EXECUTE/exec". All paths are absolute but Shadow doesn't recognized them as absolute. So shadow crashes:

$ gdb `which condor_shadow` core.30331
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-48.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/condor_shadow...Reading symbols from /usr/lib/debug/usr/sbin/condor_shadow.debug...done.
done.
[New Thread 30331]
Missing separate debuginfo for 
Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/23/e5a71140d5c8345a1c915447c466c23f43dc02
Reading symbols from /lib/libdl-2.12.so...Reading symbols from /usr/lib/debug/lib/libdl-2.12.so.debug...done.
done.
Loaded symbols for /lib/libdl-2.12.so
Reading symbols from /usr/lib/libclassad.so.1.1.0...Reading symbols from /usr/lib/debug/usr/lib/libclassad.so.1.1.0.debug...done.
done.
Loaded symbols for /usr/lib/libclassad.so.1.1.0
Reading symbols from /lib/libexpat.so.1.5.2...Reading symbols from /usr/lib/debug/lib/libexpat.so.1.5.2.debug...done.
done.
Loaded symbols for /lib/libexpat.so.1.5.2
Reading symbols from /lib/libpcre.so.0.0.1...Reading symbols from /usr/lib/debug/lib/libpcre.so.0.0.1.debug...done.
done.
Loaded symbols for /lib/libpcre.so.0.0.1
Reading symbols from /usr/lib/libssl.so.1.0.0...Reading symbols from /usr/lib/debug/usr/lib/libssl.so.1.0.0.debug...done.
done.
Loaded symbols for /usr/lib/libssl.so.1.0.0
Reading symbols from /usr/lib/libcrypto.so.1.0.0...Reading symbols from /usr/lib/debug/usr/lib/libcrypto.so.1.0.0.debug...done.
done.
Loaded symbols for /usr/lib/libcrypto.so.1.0.0
Reading symbols from /lib/libkrb5.so.3.3...Reading symbols from /usr/lib/debug/lib/libkrb5.so.3.3.debug...done.
done.
Loaded symbols for /lib/libkrb5.so.3.3
Reading symbols from /lib/libcom_err.so.2.1...Reading symbols from /usr/lib/debug/lib/libcom_err.so.2.1.debug...done.
done.
Loaded symbols for /lib/libcom_err.so.2.1
Reading symbols from /lib/libk5crypto.so.3.1...Reading symbols from /usr/lib/debug/lib/libk5crypto.so.3.1.debug...done.
done.
Loaded symbols for /lib/libk5crypto.so.3.1
Reading symbols from /lib/libkrb5support.so.0.1...Reading symbols from /usr/lib/debug/lib/libkrb5support.so.0.1.debug...done.
done.
Loaded symbols for /lib/libkrb5support.so.0.1
Reading symbols from /usr/lib/libstdc++.so.6.0.13...Reading symbols from /usr/lib/debug/usr/lib/libstdc++.so.6.0.13.debug...done.
done.
Loaded symbols for /usr/lib/libstdc++.so.6.0.13
Reading symbols from /lib/libm-2.12.so...Reading symbols from /usr/lib/debug/lib/libm-2.12.so.debug...done.
done.
Loaded symbols for /lib/libm-2.12.so
Reading symbols from /lib/libgcc_s-4.4.5-20110214.so.1...Reading symbols from /usr/lib/debug/lib/libgcc_s-4.4.5-20110214.so.1.debug...done.
done.
Loaded symbols for /lib/libgcc_s-4.4.5-20110214.so.1
Reading symbols from /lib/libpthread-2.12.so...Reading symbols from /usr/lib/debug/lib/libpthread-2.12.so.debug...done.
[Thread debugging using libthread_db enabled]
done.
Loaded symbols for /lib/libpthread-2.12.so
Reading symbols from /lib/libc-2.12.so...Reading symbols from /usr/lib/debug/lib/libc-2.12.so.debug...done.
done.
Loaded symbols for /lib/libc-2.12.so
Reading symbols from /lib/ld-2.12.so...Reading symbols from /usr/lib/debug/lib/ld-2.12.so.debug...done.
done.
Loaded symbols for /lib/ld-2.12.so
Reading symbols from /lib/libgssapi_krb5.so.2.2...Reading symbols from /usr/lib/debug/lib/libgssapi_krb5.so.2.2.debug...done.
done.
Loaded symbols for /lib/libgssapi_krb5.so.2.2
Reading symbols from /lib/libresolv-2.12.so...Reading symbols from /usr/lib/debug/lib/libresolv-2.12.so.debug...done.
done.
Loaded symbols for /lib/libresolv-2.12.so
Reading symbols from /lib/libz.so.1.2.3...Reading symbols from /usr/lib/debug/lib/libz.so.1.2.3.debug...done.
done.
Loaded symbols for /lib/libz.so.1.2.3
Reading symbols from /lib/libkeyutils.so.1.3...Reading symbols from /usr/lib/debug/lib/libkeyutils.so.1.3.debug...done.
done.
Loaded symbols for /lib/libkeyutils.so.1.3
Reading symbols from /lib/libselinux.so.1...Reading symbols from /usr/lib/debug/lib/libselinux.so.1.debug...done.
done.
Loaded symbols for /lib/libselinux.so.1
Reading symbols from /lib/libnss_files-2.12.so...Reading symbols from /usr/lib/debug/lib/libnss_files-2.12.so.debug...done.
done.
Loaded symbols for /lib/libnss_files-2.12.so
Reading symbols from /lib/libnss_dns-2.12.so...Reading symbols from /usr/lib/debug/lib/libnss_dns-2.12.so.debug...done.
done.
Loaded symbols for /lib/libnss_dns-2.12.so
Core was generated by `condor_shadow -f 55160.0 --schedd=<_ip_:43223> --xfer-queue=limit=upload'.
Program terminated with signal 11, Segmentation fault.
#0  0x00a73424 in __kernel_vsyscall ()
(gdb) thread apply all bt

Thread 1 (Thread 0xb7820750 (LWP 30331)):
#0  0x00a73424 in __kernel_vsyscall ()
#1  0x007167e0 in raise (sig=11) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
#2  0x0815aed4 in sig_backtrace_handler (signum=11) at /usr/src/debug/condor-7.6.0/src/condor_utils/dprintf_config.cpp:75
#3  <signal handler called>
#4  0x080bac81 in BaseShadow::updateJobInQueue (this=0x85ab258, type=U_HOLD) at /usr/src/debug/condor-7.6.0/src/condor_shadow.V6.1/baseshadow.cpp:1177
#5  0x080bc651 in BaseShadow::holdJob (this=0x85ab258, reason=0x8596c98 "Failed to initialize user log to c:\\/c:\\users\\admini~1.mku\\appdata\\local\\temp\\mrg_1.1.log", hold_reason_code=22, 
    hold_reason_subcode=0) at /usr/src/debug/condor-7.6.0/src/condor_shadow.V6.1/baseshadow.cpp:422
#6  0x080bc863 in BaseShadow::initUserLog (this=0x85ab258) at /usr/src/debug/condor-7.6.0/src/condor_shadow.V6.1/baseshadow.cpp:826
#7  0x080bd56b in BaseShadow::baseInit (this=0x85ab258, job_ad=0x85a8f08, schedd_addr=0xbfcb5ccf "<ip:49370>", xfer_queue_contact_info=0xbfcb5ca0 "limit=upload,download;addr=<ip:49370>")
    at /usr/src/debug/condor-7.6.0/src/condor_shadow.V6.1/baseshadow.cpp:159
#8  0x080ad831 in UniShadow::init (this=0x85ab258, job_ad=0x85a8f08, schedd_addr=0xbfcb5ccf "<ip:49370>", xfer_queue_contact_info=0xbfcb5ca0 "limit=upload,download;addr=<ip:49370>")
    at /usr/src/debug/condor-7.6.0/src/condor_shadow.V6.1/shadow.cpp:102
#9  0x080b2c1d in initShadow (ad=0x85a8f08) at /usr/src/debug/condor-7.6.0/src/condor_shadow.V6.1/shadow_v61_main.cpp:272
#10 0x080b2d92 in startShadow (ad=0x85a8f08) at /usr/src/debug/condor-7.6.0/src/condor_shadow.V6.1/shadow_v61_main.cpp:292
#11 0x080e3045 in main (argc=6, argv=0xbfcb5ad8) at /usr/src/debug/condor-7.6.0/src/condor_daemon_core.V6/daemon_core_main.cpp:2374


Version-Release number of selected component (if applicable):
condor-7.6.1-0.10
condor-win-7.6.1-0.11

How reproducible:
100%

Steps to Reproduce:
1. setup pool: Linux1 - CM, Sched, Exec; Linux2 - Sched, Exec; Windows - Exec
2. disable authentication with claimtobe
3. add all users who will submit from Windows to Linux machine
4. Submit Windows job from Windows 
  
Actual results:
Condor_shadow doesn't recognize absolute path.

Expected results:
Condor will recognize absolute path and there will be NO corefiles in $(LOG) directory.

Comment 1 Timothy St. Clair 2011-06-09 13:29:27 UTC
I think this is b/c you are using windows short names "~".. I use full absolute paths all the time. 

Either way we should probably support windows short named paths.

Comment 3 Timothy St. Clair 2011-07-08 18:43:34 UTC
Short names should be avoided when possible because of conflicts with the CLASSAD language.  The correct method would be to quote, but even then, it's likely not the best solution.  

I think the best method will be to through an error not allowing short names during submit and force the user to specify the full path.

Comment 4 Timothy St. Clair 2011-07-08 20:27:39 UTC
<retract last comment>

Can not repro with latest build (condor-7.6.3-0.1) using:

Error=C:\condor\tests\UUUUUU~1.FOO\mrg_$(Cluster).$(Process).err
Output=C:\condor\tests\UUUUUU~1.FOO\mrg_$(Cluster).$(Process).out
Log=C:\condor\tests\UUUUUU~1.FOO\mrg_$(Cluster).$(Process).log

condor_submit -remote my_schedd my.sub

---------------------------------------------------------------
Could you please provide repro info with the latest build.


Note You need to log in before you can comment on or make changes to this bug.