Bug 523482

Summary: Slots left Claimed Idle when there are no jobs
Product: Red Hat Enterprise MRG Reporter: Matthew Farrellee <matt>
Component: condorAssignee: Matthew Farrellee <matt>
Status: CLOSED ERRATA QA Contact: Jan Sarenik <jsarenik>
Severity: medium Docs Contact:
Priority: low    
Version: 1.1CC: iboverma, jsarenik, lbrindle
Target Milestone: 1.2   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Grid bug fix C: A failure a submit node when starting a job C: The condor_schedd would claim a slot but not release it on failure to execute the condor_shadow. F: The condor_schedd now releases slots on failure to execute the condor_shadow. R: The slot will now return to Unclaimed/Idle instead of remaining Claimed/Idle. If a submit node failed to start a job correctly, the condor_schedd would claim a slot and then never release it. The condor_schedd has now been altered so that it releases slots on failure to execute the condor_shadow. The slot will now return to Unclaimed/Idle instead of remaining in Claimed/Idle.
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-12-03 09:19:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 527551    
Attachments:
Description Flags
Simple verification script none

Description Matthew Farrellee 2009-09-15 16:46:17 UTC
Description of problem:

It is possible to have slots left in the Claimed state with no running jobs, and no jobs to run!, until the claim lease expires.


Version-Release number of selected component (if applicable):

condor-7.4.0-0.4


How reproducible:

100%


Steps to Reproduce:
1. start condor (make sure schedd start)
2. mv /usr/sbin/condor_shadow /usr/sbin/condor_shadow-
3. submit a job (echo -e 'executable=/bin/sleep\narguments=5\ntransfer_executable=false\nnotification=never\nqueue 1\n' | condor_submit)
4. watch condor_status -total until there is a Claimed slot, can also check condor_status | grep Claimed, it should be Claimed Idle
5. remove the job (condor_rm -a will do it)

  
Actual results:

The Claimed slot will remain that way until the claim lease expires. This can be verified by watching the StartLog on the slot's machine.


Expected results:

The slot should be returned to the Unclaimed state.


Additional info:

Running a job on a slot is a multi-step process. Three steps the Schedd goes through are to 1) REQUEST_CLAIM on the Startd 2) start a Shadow for the job 3) let the Shadow ACTIVATE_CLAIM on the Startd. The Schedd will RELEASE_CLAIM on the Startd for all jobs that have an active Shadow. Anything that prevents step (2) from succeeding, such as the Shadow cannot be executed (induced above) or a condor_rm -a while the Schedd is loaded (original way of hitting this bug) and Shadow execution is cancelled, will demonstrate this bug.

The Schedd keeps match_rec's for each job it is starting, the match_rec has a status field that can indicate M_STARTD_CONTACT_LIMBO (step 1 completed), M_CLAIMED (step 2 completed). The Schedd sends RELEASE_CLAIM on M_CLAIMED and should be changed to on M_STARTD_CONTACT_LIMBO as well.

Comment 2 Matthew Farrellee 2009-09-15 17:51:48 UTC
Resolved upstream in V7_4-branch, will be in 7.4.0-0.5

http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=742

Comment 4 Jan Sarenik 2009-10-29 10:14:46 UTC
On condor-7.4.1-0.2.el5 slot is not returned to the Unclaimed state
the same way as on condor-7.4.0-0.4.el5.

Comment 5 Irina Boverman 2009-10-29 14:30:01 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
please see bug summary.

Comment 6 Matthew Farrellee 2009-10-30 12:01:17 UTC
Deleted Release Notes Contents.

Old Contents:
please see bug summary.

Comment 7 Matthew Farrellee 2009-11-03 16:18:03 UTC
After some further investigation, the amount of time until the Schedd sends a RELEASE_CLAIM is controlled by SCHEDD_INTERVAL (defaults to 300 seconds). I believe this is working as expected.

Comment 8 Jan Sarenik 2009-11-03 17:25:22 UTC
Created attachment 367329 [details]
Simple verification script

Comment 9 Jan Sarenik 2009-11-03 17:32:08 UTC
condor-7.4.1-0.2.el5
condor-7.4.1-0.2.el4

VERIFIED

Comment 10 Irina Boverman 2009-11-09 19:19:05 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
see summary

Comment 11 Lana Brindley 2009-11-11 20:33:40 UTC
(In reply to comment #7)
> After some further investigation, the amount of time until the Schedd sends a
> RELEASE_CLAIM is controlled by SCHEDD_INTERVAL (defaults to 300 seconds). I
> believe this is working as expected.  

So ... no release note required?

LKB

Comment 12 Matthew Farrellee 2009-11-11 21:30:02 UTC
Still need relnote...

C: A failure a submit node when starting a job can result in a slot remaining in the Claimed/Idle state/activity without a job until the claim lease expires.
C: The condor_schedd would claim a slot but not release it on failure to execute the condor_shadow.
F: The condor_schedd now releases slots on failure to execute the condor_shadow.
R: The slot will not return to Unclaimed/Idle instead of remaining Claimed/Idle.

Comment 13 Jan Sarenik 2009-11-12 09:59:56 UTC
(In reply to comment #11)
> So ... no release note required?

Lana, as far as I understand that was reply to my confusion
that the slots did not get freed as quickly as I expected.
Then I used the mentioned variable, to see them being
freed more quickly, which did not happen on the buggy
version.

Comment 14 Jan Sarenik 2009-11-12 10:00:40 UTC
(In reply to comment #13)
> Lana, as far as I understand that was reply to my confusion

_that_ stands for comment #7 of course.

Comment 15 Lana Brindley 2009-11-19 00:40:46 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1,8 @@
-see summary+Grid bug fix
+
+C: A failure a submit node when starting a job
+C: The condor_schedd would claim a slot but not release it on failure to execute the condor_shadow.
+F: The condor_schedd now releases slots on failure to execute the condor_shadow.
+R: The slot will now return to Unclaimed/Idle instead of remaining Claimed/Idle. 
+
+If a submit node failed to start a job correctly, the condor_schedd would claim a slot and then never release it. The condor_schedd has now been altered so that it releases slots on failure to execute the condor_shadow. The slot will now return to Unclaimed/Idle instead of remaining in Claimed/Idle.

Comment 17 errata-xmlrpc 2009-12-03 09:19:37 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1633.html