Description of problem: While scalability testing in the customers environment (IT 650973) we saw a lot of 108 exits for jobs - grossly, about 20% of jobs attempting to execute exited 108. The scenario we walked through for this was: * Schedd gets a claim to a partitioned slot and runs a job in it. * Schedd attempts to re-use the claimed slot to run an additonal job. * The additional job requires more memory than is available on the system. * The startd rejects the job (108 exit). These jobs with differing memory requests should go back through a negotiation cycle or be bucketed into identically-sized claims. This slows our ability to saturate our pool and causes variance in job execution times (e.g. 80% of a group of jobs will hit available slots sooner than the other 20% each cycle, with a few stragglers waiting behind all others and blocking job group completion). How reproducible: 100% Steps to Reproduce: Submit jobs with differing memory requirements.
Reproduce using: cmd = /bin/sleep args = 45 request_memory = 10 notification = never queue request_memory = 20 queue request_memory = 30 queue After the cluster runs you can see excess Shadow starts: $ condor_history -l | grep NumShadowStarts NumShadowStarts = 1 NumShadowStarts = 2 NumShadowStarts = 2 Also, in logs you will see (w/ D_FULLDEBUG), SchedLog (note 108 exit code) - Job 156.1: is runnable ConcurrencyLimits match, can reuse claim Scheduler::start_std - job=156.1 on <127.0.0.1:43575> Queueing job 156.1 in runnable job queue start next job after 0 sec, JobsThisBurst 0 Match (robin.local <127.0.0.1:43575>#1270037753#59#... for matt) - running 156.1 ... Job cleanup for 156.0 will not block, calling jobIsFinished() directly ... Job prep for 156.1 will not block, calling aboutToSpawnJobHandler() directly aboutToSpawnJobHandler() completed for job 156.1, attempting to spawn job handler ... Shadow pid 32426 for job 156.1 exited with status 108 Called send_vacate( <127.0.0.1:43575>, 443 ) via UDP Completed RELEASE_CLAIM to startd at <127.0.0.1:43575> Match record (robin.local <127.0.0.1:43575>#1270037753#59#... for matt, 156.1) deleted Deleting shadow rec for PID 32426, job (156.1) Entered check_zombie( 32426, 0x0x96fb0a4, st=2 ) Marked job 156.1 as IDLE ShadowLog (note 108 exit code) - $ grep 156.1 /var/log/condor/ShadowLog Initializing a VANILLA shadow for job 156.1 (156.1) (32426): Request to run on robin.local <127.0.0.1:43575> was REFUSED (156.1) (32426): Job 156.1 is being evicted from robin.local (156.1) (32426): logEvictEvent with unknown reason (108), aborting (156.1) (32426): **** condor_shadow (condor_SHADOW) pid 32426 EXITING WITH STATUS 108 Initializing a VANILLA shadow for job 156.1 (156.1) (32429): Request to run on robin.local <127.0.0.1:43575> was ACCEPTED (156.1) (32429): Job 156.1 terminated: exited with status 0 (156.1) (32429): **** condor_shadow (condor_SHADOW) pid 32429 EXITING WITH STATUS 100 StartLog (note "Machine Requirements check failed!") - Called deactivate_claim_forcibly() Starter pid 32402 exited with status 0 State change: starter exited Changing activity: Busy -> Idle Got activate_claim request from shadow (<127.0.0.1:34712>) Machine Requirements check failed! Called deactivate_claim_forcibly() State change: received RELEASE_CLAIM command Changing state and activity: Claimed/Idle -> Preempting/Vacating State change: No preempting claim, returning to owner Changing state and activity: Preempting/Vacating -> Owner/Idle State change: IS_OWNER is false Changing state: Owner -> Unclaimed Changing state: Unclaimed -> Delete Resource no longer needed, deleting
Verify by not seeing the 108 errors, the requirements failure, and showing the a reversed set of jobs do use the same claim. With no 108 errors, the Schedd should give up on the claim and go back to the Negotiator for each job. To show the reversed set of jobs uses a single claim use, cmd = /bin/sleep args = 45 request_memory = 30 notification = never queue request_memory = 20 queue request_memory = 10 queue And observe that the slot created for the first job, with memory of 30, is used for all the other jobs, use condor_status or see SchedLog reuse the claim.
Fix upstream and built in 7.4.3-0.9
Reproduced on condor-7.4.1-0.7.1.el5: $CondorVersion: 7.4.1 Dec 11 2009 BuildID: RH-7.4.1-0.7.1.el5 PRE-RELEASE $ $CondorPlatform: X86_64-LINUX_RHEL5 $ # grep EXIT /var/log/condor/ShadowLog | tail 04/21 11:53:06 (102.19) (5329): **** condor_shadow (condor_SHADOW) pid 5329 EXITING WITH STATUS 112 04/21 11:53:06 (102.20) (5336): **** condor_shadow (condor_SHADOW) pid 5336 EXITING WITH STATUS 112 04/21 11:53:07 (102.21) (5342): **** condor_shadow (condor_SHADOW) pid 5342 EXITING WITH STATUS 112 04/21 11:53:07 (102.22) (5347): **** condor_shadow (condor_SHADOW) pid 5347 EXITING WITH STATUS 112 04/21 11:53:08 (102.23) (5349): **** condor_shadow (condor_SHADOW) pid 5349 EXITING WITH STATUS 112 04/21 11:53:08 (102.24) (5352): **** condor_shadow (condor_SHADOW) pid 5352 EXITING WITH STATUS 112 04/21 11:53:08 (102.25) (5356): **** condor_shadow (condor_SHADOW) pid 5356 EXITING WITH STATUS 112 04/26 15:25:33 (103.0) (2769): **** condor_shadow (condor_SHADOW) pid 2769 EXITING WITH STATUS 100 04/26 15:26:19 (103.1) (2783): **** condor_shadow (condor_SHADOW) pid 2783 EXITING WITH STATUS 100 04/26 15:27:04 (103.2) (2794): **** condor_shadow (condor_SHADOW) pid 2794 EXITING WITH STATUS 100 # grep 103. /var/log/condor/ShadowLog 04/26 15:24:48 Initializing a VANILLA shadow for job 103.0 04/26 15:24:48 (103.0) (2769): Request to run on condor1.lan <192.168.122.11:52905> was ACCEPTED 04/26 15:25:33 (103.0) (2769): Job 103.0 terminated: exited with status 0 04/26 15:25:33 (103.0) (2769): **** condor_shadow (condor_SHADOW) pid 2769 EXITING WITH STATUS 100 04/26 15:25:34 Initializing a VANILLA shadow for job 103.1 04/26 15:25:34 (103.1) (2783): Request to run on condor1.lan <192.168.122.11:52905> was ACCEPTED 04/26 15:26:19 (103.1) (2783): Job 103.1 terminated: exited with status 0 04/26 15:26:19 (103.1) (2783): **** condor_shadow (condor_SHADOW) pid 2783 EXITING WITH STATUS 100 04/26 15:26:19 Initializing a VANILLA shadow for job 103.2 04/26 15:26:19 (103.2) (2794): Request to run on condor1.lan <192.168.122.11:52905> was ACCEPTED 04/26 15:27:04 (103.2) (2794): Job 103.2 terminated: exited with status 0 04/26 15:27:04 (103.2) (2794): **** condor_shadow (condor_SHADOW) pid 2794 EXITING WITH STATUS 100
As you can see in previous comment, I'm not able to reproduce this issue on condor-7.4.1-0.7.1.el5. All jobs were executed and there is no such error 108 in ShadowLog. The "Machine Requirements check failed" message in StartLog file is missing, too. My condor config: CONDOR_DEVELOPERS = NONE CONDOR_HOST = $(FULL_HOSTNAME) COLLECTOR_NAME = Personal Condor START = TRUE SUSPEND = FALSE PREEMPT = FALSE KILL = FALSE DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD NEGOTIATOR_INTERVAL = 20 # A local config dir to hold additional configuration LOCAL_CONFIG_DIR = $(LOCAL_DIR)/config TRUST_UID_DOMAIN = TRUE SLOT_TYPE_1 = cpus=100% SLOT_TYPE_1_PARTITIONABLE = TRUE NUM_CPUS = 250 NUM_SLOTS_TYPE_1 = 1 Maybe some special load/configuration is needed for reproduction, which is not included in this bug. Can you please find exact version of condor where the bug could be reproduced? Could you also include condor configuration used?
Does this reproduce with 7.4.3-0.8?
Created attachment 410050 [details] condor logs See attached log files from my testing. It's from condor 7.4.3-0.11. The problem was reproduced on x86 and x86_64 on RHEL4 and RHEL5. In less than 10% of cases the problem occured. The test was executed 500 times on each system to be sure. I used condor configuration from my previous comment and run this jobs on it: echo -e "cmd=/bin/sleep\nargs=1\nrequest_memory=10\nnotification=never\nqueue\n"\ "request_memory=20\nqueue\nrequest_memory=30\nqueue" | condor_submit The error "Machine Requirements check failed" was found about 12 times from 500 tests. Please, could you take a look at the logs and check if the problem could be persist in 7.4.3-0.11 version of condor (the already patched one)?
Tested on current version condor-7.4.3-0.16 for all combination RHEL4/5 and i386/x86_64. echo -e "cmd=/bin/sleep\nargs=1\nrequest_memory=10\nnotification=never\nqueue\n"\ "request_memory=20\nqueue\nrequest_memory=30\nqueue" | condor_submit # condor_history -l | grep NumShadowStart NumShadowStarts = 1 No error about machine requirements found and/or error 108 message. >>> VERIFIED
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, startd rejected a large number of jobs (108 exit) with differing memory requests in the scheduler daemon, which significantly slowed down the ability to saturate the pool and causes variance in job execution times. With this update, the the a reversed set of jobs uses the same claim and the issue is resolved.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html