578216 – condor_schedd reuses claims to partitioned slots inappropriately

Bug 578216 - condor_schedd reuses claims to partitioned slots inappropriately

Summary: condor_schedd reuses claims to partitioned slots inappropriately

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor
Sub Component:
Version:	1.2
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	1.3
Target Release:	---
Assignee:	Matthew Farrellee
QA Contact:	Tomas Rusnak
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-03-30 15:11 UTC by Scott Spurrier
Modified:	2018-10-27 14:15 UTC (History)
CC List:	4 users (show)
Fixed In Version:	7.4.3-0.9
Doc Type:	Bug Fix
Doc Text:	Previously, startd rejected a large number of jobs (108 exit) with differing memory requests in the scheduler daemon, which significantly slowed down the ability to saturate the pool and causes variance in job execution times. With this update, the the a reversed set of jobs uses the same claim and the issue is resolved.
Clone Of:
Environment:
Last Closed:	2010-10-14 16:08:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
condor logs (1.27 MB, application/x-gzip) 2010-04-29 09:42 UTC, Tomas Rusnak	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2010:0773	0	normal	SHIPPED_LIVE	Moderate: Red Hat Enterprise MRG Messaging and Grid Version 1.3	2010-10-14 15:56:44 UTC

Description Scott Spurrier 2010-03-30 15:11:44 UTC

Description of problem:
While scalability testing in the customers environment (IT 650973) we saw a lot of 108 exits for jobs - grossly, about 20% of jobs attempting to execute exited 108.

The scenario we walked through for this was:

* Schedd gets a claim to a partitioned slot and runs a job in it.
* Schedd attempts to re-use the claimed slot to run an additonal job.
* The additional job requires more memory than is available on the system.
* The startd rejects the job (108 exit).

These jobs with differing memory requests should go back through a negotiation cycle or be bucketed into identically-sized claims.

This slows our ability to saturate our pool and causes variance in job execution times (e.g. 80% of a group of jobs will hit available slots sooner than the other 20% each cycle, with a few stragglers waiting behind all others and blocking job group completion).

How reproducible:

100%

Steps to Reproduce:

Submit jobs with differing memory requirements.

Comment 1 Matthew Farrellee 2010-03-31 17:52:53 UTC

Reproduce using:

cmd = /bin/sleep
args = 45
request_memory = 10
notification = never
queue
request_memory = 20
queue
request_memory = 30
queue

After the cluster runs you can see excess Shadow starts:

$ condor_history -l | grep NumShadowStarts
NumShadowStarts = 1
NumShadowStarts = 2
NumShadowStarts = 2

Also, in logs you will see (w/ D_FULLDEBUG),

SchedLog (note 108 exit code) - 

Job 156.1: is runnable
ConcurrencyLimits match, can reuse claim
Scheduler::start_std - job=156.1 on <127.0.0.1:43575>
Queueing job 156.1 in runnable job queue
start next job after 0 sec, JobsThisBurst 0
Match (robin.local <127.0.0.1:43575>#1270037753#59#... for matt) - running 156.1
...
Job cleanup for 156.0 will not block, calling jobIsFinished() directly
...
Job prep for 156.1 will not block, calling aboutToSpawnJobHandler() directly
aboutToSpawnJobHandler() completed for job 156.1, attempting to spawn job handler
...
Shadow pid 32426 for job 156.1 exited with status 108
Called send_vacate( <127.0.0.1:43575>, 443 ) via UDP
Completed RELEASE_CLAIM to startd at <127.0.0.1:43575>
Match record (robin.local <127.0.0.1:43575>#1270037753#59#... for matt, 156.1) deleted
Deleting shadow rec for PID 32426, job (156.1)
Entered check_zombie( 32426, 0x0x96fb0a4, st=2 )
Marked job 156.1 as IDLE

ShadowLog (note 108 exit code) -

$ grep 156.1 /var/log/condor/ShadowLog
Initializing a VANILLA shadow for job 156.1
(156.1) (32426): Request to run on robin.local <127.0.0.1:43575> was REFUSED
(156.1) (32426): Job 156.1 is being evicted from robin.local
(156.1) (32426): logEvictEvent with unknown reason (108), aborting
(156.1) (32426): **** condor_shadow (condor_SHADOW) pid 32426 EXITING WITH STATUS 108
Initializing a VANILLA shadow for job 156.1
(156.1) (32429): Request to run on robin.local <127.0.0.1:43575> was ACCEPTED
(156.1) (32429): Job 156.1 terminated: exited with status 0
(156.1) (32429): **** condor_shadow (condor_SHADOW) pid 32429 EXITING WITH STATUS 100

StartLog (note "Machine Requirements check failed!") -

Called deactivate_claim_forcibly()
Starter pid 32402 exited with status 0
State change: starter exited
Changing activity: Busy -> Idle
Got activate_claim request from shadow (<127.0.0.1:34712>)
Machine Requirements check failed!
Called deactivate_claim_forcibly()
State change: received RELEASE_CLAIM command
Changing state and activity: Claimed/Idle -> Preempting/Vacating
State change: No preempting claim, returning to owner
Changing state and activity: Preempting/Vacating -> Owner/Idle
State change: IS_OWNER is false
Changing state: Owner -> Unclaimed
Changing state: Unclaimed -> Delete
Resource no longer needed, deleting

Comment 2 Matthew Farrellee 2010-03-31 17:59:29 UTC

Verify by not seeing the 108 errors, the requirements failure, and showing the a reversed set of jobs do use the same claim.

With no 108 errors, the Schedd should give up on the claim and go back to the Negotiator for each job.

To show the reversed set of jobs uses a single claim use,

cmd = /bin/sleep
args = 45
request_memory = 30
notification = never
queue
request_memory = 20
queue
request_memory = 10
queue

And observe that the slot created for the first job, with memory of 30, is used for all the other jobs, use condor_status or see SchedLog reuse the claim.

Comment 3 Matthew Farrellee 2010-03-31 18:12:39 UTC

Fix upstream and built in 7.4.3-0.9

Comment 4 Tomas Rusnak 2010-04-26 13:30:52 UTC

Reproduced on condor-7.4.1-0.7.1.el5:

$CondorVersion: 7.4.1 Dec 11 2009 BuildID: RH-7.4.1-0.7.1.el5 PRE-RELEASE $
$CondorPlatform: X86_64-LINUX_RHEL5 $

# grep EXIT /var/log/condor/ShadowLog | tail
04/21 11:53:06 (102.19) (5329): **** condor_shadow (condor_SHADOW) pid 5329 EXITING WITH STATUS 112
04/21 11:53:06 (102.20) (5336): **** condor_shadow (condor_SHADOW) pid 5336 EXITING WITH STATUS 112
04/21 11:53:07 (102.21) (5342): **** condor_shadow (condor_SHADOW) pid 5342 EXITING WITH STATUS 112
04/21 11:53:07 (102.22) (5347): **** condor_shadow (condor_SHADOW) pid 5347 EXITING WITH STATUS 112
04/21 11:53:08 (102.23) (5349): **** condor_shadow (condor_SHADOW) pid 5349 EXITING WITH STATUS 112
04/21 11:53:08 (102.24) (5352): **** condor_shadow (condor_SHADOW) pid 5352 EXITING WITH STATUS 112
04/21 11:53:08 (102.25) (5356): **** condor_shadow (condor_SHADOW) pid 5356 EXITING WITH STATUS 112
04/26 15:25:33 (103.0) (2769): **** condor_shadow (condor_SHADOW) pid 2769 EXITING WITH STATUS 100
04/26 15:26:19 (103.1) (2783): **** condor_shadow (condor_SHADOW) pid 2783 EXITING WITH STATUS 100
04/26 15:27:04 (103.2) (2794): **** condor_shadow (condor_SHADOW) pid 2794 EXITING WITH STATUS 100

# grep 103.  /var/log/condor/ShadowLog
04/26 15:24:48 Initializing a VANILLA shadow for job 103.0
04/26 15:24:48 (103.0) (2769): Request to run on condor1.lan <192.168.122.11:52905> was ACCEPTED
04/26 15:25:33 (103.0) (2769): Job 103.0 terminated: exited with status 0
04/26 15:25:33 (103.0) (2769): **** condor_shadow (condor_SHADOW) pid 2769 EXITING WITH STATUS 100
04/26 15:25:34 Initializing a VANILLA shadow for job 103.1
04/26 15:25:34 (103.1) (2783): Request to run on condor1.lan <192.168.122.11:52905> was ACCEPTED
04/26 15:26:19 (103.1) (2783): Job 103.1 terminated: exited with status 0
04/26 15:26:19 (103.1) (2783): **** condor_shadow (condor_SHADOW) pid 2783 EXITING WITH STATUS 100
04/26 15:26:19 Initializing a VANILLA shadow for job 103.2
04/26 15:26:19 (103.2) (2794): Request to run on condor1.lan <192.168.122.11:52905> was ACCEPTED
04/26 15:27:04 (103.2) (2794): Job 103.2 terminated: exited with status 0
04/26 15:27:04 (103.2) (2794): **** condor_shadow (condor_SHADOW) pid 2794 EXITING WITH STATUS 100

Comment 5 Tomas Rusnak 2010-04-27 12:01:04 UTC

As you can see in previous comment, I'm not able to reproduce this issue on condor-7.4.1-0.7.1.el5. All jobs were executed and there is no such error 108 in ShadowLog. The "Machine Requirements check failed" message in StartLog file is missing, too.

My condor config:

CONDOR_DEVELOPERS = NONE

CONDOR_HOST = $(FULL_HOSTNAME)
COLLECTOR_NAME = Personal Condor
START = TRUE
SUSPEND = FALSE
PREEMPT = FALSE
KILL = FALSE
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
NEGOTIATOR_INTERVAL = 20

# A local config dir to hold additional configuration
LOCAL_CONFIG_DIR = $(LOCAL_DIR)/config
TRUST_UID_DOMAIN = TRUE

SLOT_TYPE_1 = cpus=100%
SLOT_TYPE_1_PARTITIONABLE = TRUE
NUM_CPUS = 250
NUM_SLOTS_TYPE_1 = 1

Maybe some special load/configuration is needed for reproduction, which is not included in this bug. Can you please find exact version of condor where the bug could be reproduced? Could you also include condor configuration used?

Comment 6 Matthew Farrellee 2010-04-28 13:05:04 UTC

Does this reproduce with 7.4.3-0.8?

Comment 7 Tomas Rusnak 2010-04-29 09:42:55 UTC

Created attachment 410050 [details]
condor logs

See attached log files from my testing. It's from condor 7.4.3-0.11. The problem was reproduced on x86 and x86_64 on RHEL4 and RHEL5. In less than 10% of cases the problem occured. The test was executed 500 times on each system to be sure.

I used condor configuration from my previous comment and run this jobs on it:

echo -e "cmd=/bin/sleep\nargs=1\nrequest_memory=10\nnotification=never\nqueue\n"\
          "request_memory=20\nqueue\nrequest_memory=30\nqueue" | condor_submit

The error "Machine Requirements check failed" was found about 12 times from 500 tests.

Please, could you take a look at the logs and check if the problem could be persist in 7.4.3-0.11 version of condor (the already patched one)?

Comment 8 Tomas Rusnak 2010-05-28 15:59:23 UTC

Tested on current version condor-7.4.3-0.16 for all combination RHEL4/5 and i386/x86_64.

echo -e "cmd=/bin/sleep\nargs=1\nrequest_memory=10\nnotification=never\nqueue\n"\
          "request_memory=20\nqueue\nrequest_memory=30\nqueue" | condor_submit

# condor_history -l | grep NumShadowStart
NumShadowStarts = 1

No error about machine requirements found and/or error 108 message.

>>> VERIFIED

Comment 10 Florian Nadge 2010-10-07 15:05:19 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, startd rejected a large number of jobs (108 exit) with differing memory requests in the scheduler daemon, which significantly slowed down the ability to saturate the  pool and causes variance in job execution times. With this update, the the a reversed set of jobs uses the same claim and the issue is resolved.

Comment 12 errata-xmlrpc 2010-10-14 16:08:53 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html

Note You need to log in before you can comment on or make changes to this bug.