Bug 699723 - [RFE] Make shadow worklife extend past CLAIM_WORKLIFE
Summary: [RFE] Make shadow worklife extend past CLAIM_WORKLIFE
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.3
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: 2.1
: ---
Assignee: Timothy St. Clair
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks: 699726
TreeView+ depends on / blocked
 
Reported: 2011-04-26 13:24 UTC by Matthew Farrellee
Modified: 2011-08-31 12:26 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-08-03 18:21:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Matthew Farrellee 2011-04-26 13:24:21 UTC
CLAIM_WORKLIFE specifies how long a claim can be reused
SHADOW_WORKLIFE specifies how long a shadow can be reused

Great scale and performance gains come from SHADOW_WORKLIFE. It eliminates the need for the Schedd to constantly spawn and reap shadows. Instead, the Schedd can pass new jobs to the Shadow.

As of 7.6.1-0.2 and earlier, the Schedd will only pass jobs to a shadow that fit the claim the shadow was spawned with. This means that when the claim expires the shadow will appear to have no more work and will also exit. Thus the CLAIM_WORKLIFE bounds the worklife of the shadow.

Scheduler::RecycleShadow() could search for a new claim to give to an existing shadow.

Comment 1 Timothy St. Clair 2011-06-13 16:28:38 UTC
Now tracking upstream 

https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2229

Comment 3 Timothy St. Clair 2011-07-12 19:35:10 UTC
Dev notes: 

Original patch would not work because of sequence of events which the schedd follows when starting a claim.  If the CLAIM_WORKLIFE expires then the schedd has no method of blocking to request a new claim for a given match_rec.  Presently, it enqueues a claim_request for a given match which then waits for a response.  Once the response has been received, it then triggers another timer to start the job.  Everything is 1:1 as it comes in from negotiation.  I will consult with upstream, but it appears that if one wants to do this it would require rearchitecting how we handle claim_request and how jobs are spun.  

It does however seem possible to reuse other transient error conditions (107 & 108).

Comment 4 Timothy St. Clair 2011-07-12 20:29:57 UTC
Dev notes:

Consulted with DanB, and he is in agreement with the assessment.  There is little to be gained with trying to recycle a shadow once a claim has expired because you would need to request a new claim, which is a high latency event.

Will pursue the option of reusing for other error conditions.


Note You need to log in before you can comment on or make changes to this bug.