Bug 604986 - Condor shadow crash: # of command handlers exceeded specified maximum
Summary: Condor shadow crash: # of command handlers exceeded specified maximum
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: Development
Hardware: All
OS: Linux
high
high
Target Milestone: 1.3
: ---
Assignee: Matthew Farrellee
QA Contact: Martin Kudlej
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-06-17 08:20 UTC by Martin Kudlej
Modified: 2010-10-20 11:29 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-10-20 11:29:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
log files and condor_config.local (530.38 KB, application/x-gzip)
2010-06-17 08:20 UTC, Martin Kudlej
no flags Details

Description Martin Kudlej 2010-06-17 08:20:39 UTC
Created attachment 424718 [details]
log files and condor_config.local

Version-Release number of selected component (if applicable):
condor-7.4.3-0.17.el5

How reproducible:
7 crashes in 2 hours

Steps to Reproduce:
1. set up Condor CM+Scheduler on RHEL5 and Windows execute nodes
2. submit jobs
3. wait for crash
  
Actual results:
There are 7 crashes in 2 hours.

Expected results:
There will be no crashes.

Additional info:
06/17 04:11:27 (32.250) (6961): ERROR "# of command handlers exceeded specified maximum" at line 893 in file daemon_core.cpp
Stack dump for process 6961 at timestamp 1276762287 (23 frames)
condor_shadow(dprintf_dump_stack+0x44)[0x80e81d4]
condor_shadow[0x80e9f34]
[0x7dd420]
/lib/libc.so.6(abort+0x101)[0x82b701]
condor_shadow(_EXCEPT_+0x93)[0x80e80a3]
condor_shadow(_ZN10DaemonCore16Register_CommandEiPKcPFiP7ServiceiP6StreamEMS2_FiiS5_ES1_S3_12DCpermissionii+0x21e)[0x80cc4ee]
condor_shadow(_ZN10DaemonCore16Register_CommandEiPKcM7ServiceFiiP6StreamES1_PS2_12DCpermissioni+0x33)[0x80cc5d3]
condor_shadow(_ZN9UniShadow4initEP7ClassAdPKcS3_+0x77)[0x80b4747]
condor_shadow(_Z10initShadowP7ClassAd+0xd6)[0x80b04e6]
condor_shadow(_Z11startShadowP7ClassAd+0x69)[0x80b0609]
condor_shadow(_Z13recycleShadowi+0x11d)[0x80b07ad]
condor_shadow(_ZN10BaseShadow12terminateJobE14update_style_t+0x16c)[0x80b234c]
condor_shadow(_ZN16ShadowUserPolicy8doActionEib+0x188)[0x80c4088]
condor_shadow(_ZN14BaseUserPolicy11checkAtExitEv+0x3c)[0x814356c]
condor_shadow(_ZN14RemoteResource14handleSysCallsEP6Stream+0x69)[0x80b9a99]
condor_shadow(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x834)[0x80d7624]
condor_shadow(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x22)[0x80d7902]
condor_shadow(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x40)[0x8151800]
condor_shadow(_ZN10DaemonCore17CallSocketHandlerERib+0x130)[0x80cf6e0]
condor_shadow(_ZN10DaemonCore6DriverEv+0x1f66)[0x80d1dd6]
condor_shadow(main+0xd80)[0x80e3b30]
/lib/libc.so.6(__libc_start_main+0xdc)[0x816e9c]
condor_shadow[0x80b0091]

Comment 1 Martin Kudlej 2010-06-17 13:06:55 UTC
I've submitted those jobs by:
for i in `seq 100`; do su xxx -c 'condor_submit /root/wait.bat.sub' || service condor stop || killall condor_schedd;sleep 30;done

cat /root/wait.bat.sub:
universe = vanilla
executable = /root/wait.bat
arguments = 1
requirements = ( Arch=="Intel") && ( OpSys=="WINNT51" || OpSys=="WINNT52" )
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
iwd = /tmp

queue 10000

Comment 2 Matthew Farrellee 2010-06-17 18:31:23 UTC
c++filt,

06/17 04:11:27 (32.250) (6961): ERROR "# of command handlers exceeded specified
maximum" at line 893 in file daemon_core.cpp
Stack dump for process 6961 at timestamp 1276762287 (23 frames)
condor_shadow(dprintf_dump_stack+0x44)[0x80e81d4]
condor_shadow[0x80e9f34]
[0x7dd420]
/lib/libc.so.6(abort+0x101)[0x82b701]
condor_shadow(_EXCEPT_+0x93)[0x80e80a3]
condor_shadow(DaemonCore::Register_Command(int, char const*, int (*)(Service*, int, Stream*), int (Service::*)(int, Stream*), char const*, Service*, DCpermission, int, int)+0x21e)[0x80cc4ee]
condor_shadow(DaemonCore::Register_Command(int, char const*, int (Service::*)(int, Stream*), char const*, Service*, DCpermission, int)+0x33)[0x80cc5d3]
condor_shadow(UniShadow::init(ClassAd*, char const*, char const*)+0x77)[0x80b4747]
condor_shadow(initShadow(ClassAd*)+0xd6)[0x80b04e6]
condor_shadow(startShadow(ClassAd*)+0x69)[0x80b0609]
condor_shadow(recycleShadow(int)+0x11d)[0x80b07ad]
condor_shadow(BaseShadow::terminateJob(update_style_t)+0x16c)[0x80b234c]
condor_shadow(ShadowUserPolicy::doAction(int, bool)+0x188)[0x80c4088]
condor_shadow(BaseUserPolicy::checkAtExit()+0x3c)[0x814356c]
condor_shadow(RemoteResource::handleSysCalls(Stream*)+0x69)[0x80b9a99]
condor_shadow(DaemonCore::CallSocketHandler_worker(int, bool, Stream*)+0x834)[0x80d7624]
condor_shadow(DaemonCore::CallSocketHandler_worker_demarshall(void*)+0x22)[0x80d7902]
condor_shadow(CondorThreads::pool_add(void (*)(void*), void*, int*, char const*)+0x40)[0x8151800]
condor_shadow(DaemonCore::CallSocketHandler(int&, bool)+0x130)[0x80cf6e0]
condor_shadow(DaemonCore::Driver()+0x1f66)[0x80d1dd6]
condor_shadow(main+0xd80)[0x80e3b30]
/lib/libc.so.6(__libc_start_main+0xdc)[0x816e9c]
condor_shadow[0x80b0091]

Comment 3 Matthew Farrellee 2010-06-17 18:34:01 UTC
This is reproducible with simple sleep jobs, e.g.

echo "cmd=/bin/sleep\nargs=0\nnotification=never\nqueue 500" | condor_submit

And recycleShadow is likely important.

Comment 4 Matthew Farrellee 2010-06-17 18:37:39 UTC
Notice shadow reuse - pid does not change until crash

$ condor_q | tail -n1 ; ps auxwww | grep condor_shadow
154 jobs; 153 idle, 1 running, 0 held
matt     17162  6.6  0.0  50200  2764 ?        S    14:27   0:00 condor_shadow -f 6339.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -

$ condor_q | tail -n1 ; ps auxwww | grep condor_shadow
89 jobs; 88 idle, 1 running, 0 held
matt     17162  6.6  0.0  50200  2768 ?        S    14:27   0:00 condor_shadow -f 6339.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -

$ condor_q | tail -n1 ; ps auxwww | grep condor_shadow
13 jobs; 12 idle, 1 running, 0 held
matt     17900  0.0  0.0  50200  2756 ?        S    14:27   0:00 condor_shadow -f 6339.239 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -

$ condor_q | tail -n1 ; ps auxwww | grep condor_shadow
0 jobs; 0 idle, 0 running, 0 held

$ grep ERROR /var/log/condor/ShadowLog
06/17 14:27:16 (6339.239) (17162): ERROR "# of command handlers exceeded specified maximum" at line 893 in file daemon_core.cpp

Comment 5 Matthew Farrellee 2010-06-17 18:38:41 UTC
Configuring SHADOW_WORKLIFE=0 allows for successful execution of a 500 job submission without any ERRORs.

Comment 7 Matthew Farrellee 2010-06-17 19:05:21 UTC
Success,

$ while [ 1 ] ; do condor_q | tail -n1 ; ps auxwww | grep condor_shadow | grep schedd; sleep 3; done
1500 jobs; 1500 idle, 0 running, 0 held
1500 jobs; 1500 idle, 0 running, 0 held
1500 jobs; 1500 idle, 0 running, 0 held
1500 jobs; 1500 idle, 0 running, 0 held
1500 jobs; 1500 idle, 0 running, 0 held
1489 jobs; 1488 idle, 1 running, 0 held
matt      3660  5.0  0.0  49996  3348 ?        S    15:02   0:00 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
1442 jobs; 1441 idle, 1 running, 0 held
matt      3660  6.2  0.0  49996  3352 ?        S    15:02   0:00 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
1395 jobs; 1394 idle, 1 running, 0 held
matt      3660  6.1  0.0  49996  3356 ?        S    15:02   0:00 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
1348 jobs; 1347 idle, 1 running, 0 held
matt      3660  6.2  0.0  49996  3360 ?        R    15:02   0:00 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
1301 jobs; 1300 idle, 1 running, 0 held
matt      3660  5.9  0.0  49996  3364 ?        S    15:02   0:00 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
1254 jobs; 1253 idle, 1 running, 0 held
matt      3660  6.0  0.0  49996  3372 ?        S    15:02   0:01 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
1207 jobs; 1206 idle, 1 running, 0 held
matt      3660  6.2  0.0  49996  3372 ?        S    15:02   0:01 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
1160 jobs; 1159 idle, 1 running, 0 held
matt      3660  6.2  0.0  49996  3372 ?        S    15:02   0:01 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
1113 jobs; 1112 idle, 1 running, 0 held
matt      3660  6.3  0.0  49996  3376 ?        S    15:02   0:01 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
1067 jobs; 1066 idle, 1 running, 0 held
matt      3660  6.3  0.0  49996  3384 ?        S    15:02   0:01 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
1021 jobs; 1020 idle, 1 running, 0 held
matt      3660  6.1  0.0  49996  3388 ?        S    15:02   0:02 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
976 jobs; 975 idle, 1 running, 0 held
matt      3660  6.1  0.0  49996  3392 ?        S    15:02   0:02 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
930 jobs; 929 idle, 1 running, 0 held
matt      3660  6.1  0.0  49996  3396 ?        S    15:02   0:02 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
884 jobs; 883 idle, 1 running, 0 held
matt      3660  6.1  0.0  49996  3400 ?        S    15:02   0:02 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
838 jobs; 837 idle, 1 running, 0 held
matt      3660  6.1  0.0  49996  3404 ?        S    15:02   0:02 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
792 jobs; 791 idle, 1 running, 0 held
matt      3660  6.2  0.0  49996  3408 ?        S    15:02   0:02 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
746 jobs; 745 idle, 1 running, 0 held
matt      3660  6.2  0.0  49996  3412 ?        S    15:02   0:03 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
700 jobs; 699 idle, 1 running, 0 held
matt      3660  6.2  0.0  49996  3420 ?        S    15:02   0:03 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
653 jobs; 652 idle, 1 running, 0 held
matt      3660  6.2  0.0  49996  3424 ?        S    15:02   0:03 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
605 jobs; 604 idle, 1 running, 0 held
matt      3660  6.2  0.0  49996  3428 ?        S    15:02   0:03 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
558 jobs; 557 idle, 1 running, 0 held
matt      3660  6.2  0.0  49996  3432 ?        S    15:02   0:04 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
509 jobs; 508 idle, 1 running, 0 held
matt      3660  6.2  0.0  49996  3436 ?        S    15:02   0:04 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
460 jobs; 459 idle, 1 running, 0 held
matt      3660  6.3  0.0  49996  3440 ?        S    15:02   0:04 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
413 jobs; 412 idle, 1 running, 0 held
matt      3660  6.2  0.0  49996  3448 ?        S    15:02   0:04 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
366 jobs; 365 idle, 1 running, 0 held
matt      3660  6.3  0.0  49996  3452 ?        S    15:02   0:04 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
319 jobs; 318 idle, 1 running, 0 held
matt      3660  6.3  0.0  49996  3456 ?        S    15:02   0:04 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
272 jobs; 271 idle, 1 running, 0 held
matt      3660  6.3  0.0  49996  3460 ?        S    15:02   0:05 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
225 jobs; 224 idle, 1 running, 0 held
matt      3660  6.3  0.0  49996  3464 ?        S    15:02   0:05 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
177 jobs; 176 idle, 1 running, 0 held
matt      3660  6.3  0.0  49996  3468 ?        S    15:02   0:05 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
129 jobs; 128 idle, 1 running, 0 held
matt      3660  6.3  0.0  49996  3476 ?        S    15:02   0:05 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
81 jobs; 80 idle, 1 running, 0 held
matt      3660  6.3  0.0  49996  3480 ?        S    15:02   0:06 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
34 jobs; 33 idle, 1 running, 0 held
matt      3660  6.4  0.0  49996  3484 ?        S    15:02   0:06 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> -
0 jobs; 0 idle, 0 running, 0 held
0 jobs; 0 idle, 0 running, 0 held
0 jobs; 0 idle, 0 running, 0 held
0 jobs; 0 idle, 0 running, 0 held
0 jobs; 0 idle, 0 running, 0 held
0 jobs; 0 idle, 0 running, 0 held

$ grep ERROR /var/log/condor/ShadowLog*

Comment 8 Matthew Farrellee 2010-06-17 19:14:45 UTC
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1466

Built post 7.4.3-0.20

Comment 9 Martin Kudlej 2010-08-02 14:28:32 UTC
I've tested this on RHEL 5.5/4.8 x i386/x86_64 with 5000 jobs according comment #3 and it works without any exception in ShadowLog. --> VERIFIED


Note You need to log in before you can comment on or make changes to this bug.