Bug 604986
Summary: | Condor shadow crash: # of command handlers exceeded specified maximum | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Martin Kudlej <mkudlej> | ||||
Component: | condor | Assignee: | Matthew Farrellee <matt> | ||||
Status: | CLOSED ERRATA | QA Contact: | Martin Kudlej <mkudlej> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | Development | CC: | matt | ||||
Target Milestone: | 1.3 | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2010-10-20 11:29:38 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
I've submitted those jobs by: for i in `seq 100`; do su xxx -c 'condor_submit /root/wait.bat.sub' || service condor stop || killall condor_schedd;sleep 30;done cat /root/wait.bat.sub: universe = vanilla executable = /root/wait.bat arguments = 1 requirements = ( Arch=="Intel") && ( OpSys=="WINNT51" || OpSys=="WINNT52" ) should_transfer_files = YES when_to_transfer_output = ON_EXIT iwd = /tmp queue 10000 c++filt, 06/17 04:11:27 (32.250) (6961): ERROR "# of command handlers exceeded specified maximum" at line 893 in file daemon_core.cpp Stack dump for process 6961 at timestamp 1276762287 (23 frames) condor_shadow(dprintf_dump_stack+0x44)[0x80e81d4] condor_shadow[0x80e9f34] [0x7dd420] /lib/libc.so.6(abort+0x101)[0x82b701] condor_shadow(_EXCEPT_+0x93)[0x80e80a3] condor_shadow(DaemonCore::Register_Command(int, char const*, int (*)(Service*, int, Stream*), int (Service::*)(int, Stream*), char const*, Service*, DCpermission, int, int)+0x21e)[0x80cc4ee] condor_shadow(DaemonCore::Register_Command(int, char const*, int (Service::*)(int, Stream*), char const*, Service*, DCpermission, int)+0x33)[0x80cc5d3] condor_shadow(UniShadow::init(ClassAd*, char const*, char const*)+0x77)[0x80b4747] condor_shadow(initShadow(ClassAd*)+0xd6)[0x80b04e6] condor_shadow(startShadow(ClassAd*)+0x69)[0x80b0609] condor_shadow(recycleShadow(int)+0x11d)[0x80b07ad] condor_shadow(BaseShadow::terminateJob(update_style_t)+0x16c)[0x80b234c] condor_shadow(ShadowUserPolicy::doAction(int, bool)+0x188)[0x80c4088] condor_shadow(BaseUserPolicy::checkAtExit()+0x3c)[0x814356c] condor_shadow(RemoteResource::handleSysCalls(Stream*)+0x69)[0x80b9a99] condor_shadow(DaemonCore::CallSocketHandler_worker(int, bool, Stream*)+0x834)[0x80d7624] condor_shadow(DaemonCore::CallSocketHandler_worker_demarshall(void*)+0x22)[0x80d7902] condor_shadow(CondorThreads::pool_add(void (*)(void*), void*, int*, char const*)+0x40)[0x8151800] condor_shadow(DaemonCore::CallSocketHandler(int&, bool)+0x130)[0x80cf6e0] condor_shadow(DaemonCore::Driver()+0x1f66)[0x80d1dd6] condor_shadow(main+0xd80)[0x80e3b30] /lib/libc.so.6(__libc_start_main+0xdc)[0x816e9c] condor_shadow[0x80b0091] This is reproducible with simple sleep jobs, e.g. echo "cmd=/bin/sleep\nargs=0\nnotification=never\nqueue 500" | condor_submit And recycleShadow is likely important. Notice shadow reuse - pid does not change until crash $ condor_q | tail -n1 ; ps auxwww | grep condor_shadow 154 jobs; 153 idle, 1 running, 0 held matt 17162 6.6 0.0 50200 2764 ? S 14:27 0:00 condor_shadow -f 6339.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - $ condor_q | tail -n1 ; ps auxwww | grep condor_shadow 89 jobs; 88 idle, 1 running, 0 held matt 17162 6.6 0.0 50200 2768 ? S 14:27 0:00 condor_shadow -f 6339.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - $ condor_q | tail -n1 ; ps auxwww | grep condor_shadow 13 jobs; 12 idle, 1 running, 0 held matt 17900 0.0 0.0 50200 2756 ? S 14:27 0:00 condor_shadow -f 6339.239 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - $ condor_q | tail -n1 ; ps auxwww | grep condor_shadow 0 jobs; 0 idle, 0 running, 0 held $ grep ERROR /var/log/condor/ShadowLog 06/17 14:27:16 (6339.239) (17162): ERROR "# of command handlers exceeded specified maximum" at line 893 in file daemon_core.cpp Configuring SHADOW_WORKLIFE=0 allows for successful execution of a 500 job submission without any ERRORs. Success, $ while [ 1 ] ; do condor_q | tail -n1 ; ps auxwww | grep condor_shadow | grep schedd; sleep 3; done 1500 jobs; 1500 idle, 0 running, 0 held 1500 jobs; 1500 idle, 0 running, 0 held 1500 jobs; 1500 idle, 0 running, 0 held 1500 jobs; 1500 idle, 0 running, 0 held 1500 jobs; 1500 idle, 0 running, 0 held 1489 jobs; 1488 idle, 1 running, 0 held matt 3660 5.0 0.0 49996 3348 ? S 15:02 0:00 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 1442 jobs; 1441 idle, 1 running, 0 held matt 3660 6.2 0.0 49996 3352 ? S 15:02 0:00 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 1395 jobs; 1394 idle, 1 running, 0 held matt 3660 6.1 0.0 49996 3356 ? S 15:02 0:00 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 1348 jobs; 1347 idle, 1 running, 0 held matt 3660 6.2 0.0 49996 3360 ? R 15:02 0:00 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 1301 jobs; 1300 idle, 1 running, 0 held matt 3660 5.9 0.0 49996 3364 ? S 15:02 0:00 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 1254 jobs; 1253 idle, 1 running, 0 held matt 3660 6.0 0.0 49996 3372 ? S 15:02 0:01 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 1207 jobs; 1206 idle, 1 running, 0 held matt 3660 6.2 0.0 49996 3372 ? S 15:02 0:01 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 1160 jobs; 1159 idle, 1 running, 0 held matt 3660 6.2 0.0 49996 3372 ? S 15:02 0:01 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 1113 jobs; 1112 idle, 1 running, 0 held matt 3660 6.3 0.0 49996 3376 ? S 15:02 0:01 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 1067 jobs; 1066 idle, 1 running, 0 held matt 3660 6.3 0.0 49996 3384 ? S 15:02 0:01 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 1021 jobs; 1020 idle, 1 running, 0 held matt 3660 6.1 0.0 49996 3388 ? S 15:02 0:02 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 976 jobs; 975 idle, 1 running, 0 held matt 3660 6.1 0.0 49996 3392 ? S 15:02 0:02 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 930 jobs; 929 idle, 1 running, 0 held matt 3660 6.1 0.0 49996 3396 ? S 15:02 0:02 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 884 jobs; 883 idle, 1 running, 0 held matt 3660 6.1 0.0 49996 3400 ? S 15:02 0:02 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 838 jobs; 837 idle, 1 running, 0 held matt 3660 6.1 0.0 49996 3404 ? S 15:02 0:02 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 792 jobs; 791 idle, 1 running, 0 held matt 3660 6.2 0.0 49996 3408 ? S 15:02 0:02 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 746 jobs; 745 idle, 1 running, 0 held matt 3660 6.2 0.0 49996 3412 ? S 15:02 0:03 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 700 jobs; 699 idle, 1 running, 0 held matt 3660 6.2 0.0 49996 3420 ? S 15:02 0:03 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 653 jobs; 652 idle, 1 running, 0 held matt 3660 6.2 0.0 49996 3424 ? S 15:02 0:03 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 605 jobs; 604 idle, 1 running, 0 held matt 3660 6.2 0.0 49996 3428 ? S 15:02 0:03 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 558 jobs; 557 idle, 1 running, 0 held matt 3660 6.2 0.0 49996 3432 ? S 15:02 0:04 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 509 jobs; 508 idle, 1 running, 0 held matt 3660 6.2 0.0 49996 3436 ? S 15:02 0:04 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 460 jobs; 459 idle, 1 running, 0 held matt 3660 6.3 0.0 49996 3440 ? S 15:02 0:04 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 413 jobs; 412 idle, 1 running, 0 held matt 3660 6.2 0.0 49996 3448 ? S 15:02 0:04 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 366 jobs; 365 idle, 1 running, 0 held matt 3660 6.3 0.0 49996 3452 ? S 15:02 0:04 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 319 jobs; 318 idle, 1 running, 0 held matt 3660 6.3 0.0 49996 3456 ? S 15:02 0:04 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 272 jobs; 271 idle, 1 running, 0 held matt 3660 6.3 0.0 49996 3460 ? S 15:02 0:05 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 225 jobs; 224 idle, 1 running, 0 held matt 3660 6.3 0.0 49996 3464 ? S 15:02 0:05 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 177 jobs; 176 idle, 1 running, 0 held matt 3660 6.3 0.0 49996 3468 ? S 15:02 0:05 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 129 jobs; 128 idle, 1 running, 0 held matt 3660 6.3 0.0 49996 3476 ? S 15:02 0:05 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 81 jobs; 80 idle, 1 running, 0 held matt 3660 6.3 0.0 49996 3480 ? S 15:02 0:06 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 34 jobs; 33 idle, 1 running, 0 held matt 3660 6.4 0.0 49996 3484 ? S 15:02 0:06 condor_shadow -f 6346.0 --schedd=<127.0.0.1:52065> --xfer-queue=limit=upload,download;addr=<127.0.0.1:41464> <127.0.0.1:41464> - 0 jobs; 0 idle, 0 running, 0 held 0 jobs; 0 idle, 0 running, 0 held 0 jobs; 0 idle, 0 running, 0 held 0 jobs; 0 idle, 0 running, 0 held 0 jobs; 0 idle, 0 running, 0 held 0 jobs; 0 idle, 0 running, 0 held $ grep ERROR /var/log/condor/ShadowLog* https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1466 Built post 7.4.3-0.20 I've tested this on RHEL 5.5/4.8 x i386/x86_64 with 5000 jobs according comment #3 and it works without any exception in ShadowLog. --> VERIFIED |
Created attachment 424718 [details] log files and condor_config.local Version-Release number of selected component (if applicable): condor-7.4.3-0.17.el5 How reproducible: 7 crashes in 2 hours Steps to Reproduce: 1. set up Condor CM+Scheduler on RHEL5 and Windows execute nodes 2. submit jobs 3. wait for crash Actual results: There are 7 crashes in 2 hours. Expected results: There will be no crashes. Additional info: 06/17 04:11:27 (32.250) (6961): ERROR "# of command handlers exceeded specified maximum" at line 893 in file daemon_core.cpp Stack dump for process 6961 at timestamp 1276762287 (23 frames) condor_shadow(dprintf_dump_stack+0x44)[0x80e81d4] condor_shadow[0x80e9f34] [0x7dd420] /lib/libc.so.6(abort+0x101)[0x82b701] condor_shadow(_EXCEPT_+0x93)[0x80e80a3] condor_shadow(_ZN10DaemonCore16Register_CommandEiPKcPFiP7ServiceiP6StreamEMS2_FiiS5_ES1_S3_12DCpermissionii+0x21e)[0x80cc4ee] condor_shadow(_ZN10DaemonCore16Register_CommandEiPKcM7ServiceFiiP6StreamES1_PS2_12DCpermissioni+0x33)[0x80cc5d3] condor_shadow(_ZN9UniShadow4initEP7ClassAdPKcS3_+0x77)[0x80b4747] condor_shadow(_Z10initShadowP7ClassAd+0xd6)[0x80b04e6] condor_shadow(_Z11startShadowP7ClassAd+0x69)[0x80b0609] condor_shadow(_Z13recycleShadowi+0x11d)[0x80b07ad] condor_shadow(_ZN10BaseShadow12terminateJobE14update_style_t+0x16c)[0x80b234c] condor_shadow(_ZN16ShadowUserPolicy8doActionEib+0x188)[0x80c4088] condor_shadow(_ZN14BaseUserPolicy11checkAtExitEv+0x3c)[0x814356c] condor_shadow(_ZN14RemoteResource14handleSysCallsEP6Stream+0x69)[0x80b9a99] condor_shadow(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x834)[0x80d7624] condor_shadow(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x22)[0x80d7902] condor_shadow(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x40)[0x8151800] condor_shadow(_ZN10DaemonCore17CallSocketHandlerERib+0x130)[0x80cf6e0] condor_shadow(_ZN10DaemonCore6DriverEv+0x1f66)[0x80d1dd6] condor_shadow(main+0xd80)[0x80e3b30] /lib/libc.so.6(__libc_start_main+0xdc)[0x816e9c] condor_shadow[0x80b0091]