Bug 523482 - Slots left Claimed Idle when there are no jobs
Summary: Slots left Claimed Idle when there are no jobs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.1
Hardware: All
OS: Linux
low
medium
Target Milestone: 1.2
: ---
Assignee: Matthew Farrellee
QA Contact: Jan Sarenik
URL:
Whiteboard:
Depends On:
Blocks: 527551
TreeView+ depends on / blocked
 
Reported: 2009-09-15 16:46 UTC by Matthew Farrellee
Modified: 2009-12-03 09:19 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Grid bug fix C: A failure a submit node when starting a job C: The condor_schedd would claim a slot but not release it on failure to execute the condor_shadow. F: The condor_schedd now releases slots on failure to execute the condor_shadow. R: The slot will now return to Unclaimed/Idle instead of remaining Claimed/Idle. If a submit node failed to start a job correctly, the condor_schedd would claim a slot and then never release it. The condor_schedd has now been altered so that it releases slots on failure to execute the condor_shadow. The slot will now return to Unclaimed/Idle instead of remaining in Claimed/Idle.
Clone Of:
Environment:
Last Closed: 2009-12-03 09:19:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Simple verification script (1022 bytes, application/x-sh)
2009-11-03 17:25 UTC, Jan Sarenik
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2009:1633 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging and Grid Version 1.2 2009-12-03 09:15:33 UTC

Description Matthew Farrellee 2009-09-15 16:46:17 UTC
Description of problem:

It is possible to have slots left in the Claimed state with no running jobs, and no jobs to run!, until the claim lease expires.


Version-Release number of selected component (if applicable):

condor-7.4.0-0.4


How reproducible:

100%


Steps to Reproduce:
1. start condor (make sure schedd start)
2. mv /usr/sbin/condor_shadow /usr/sbin/condor_shadow-
3. submit a job (echo -e 'executable=/bin/sleep\narguments=5\ntransfer_executable=false\nnotification=never\nqueue 1\n' | condor_submit)
4. watch condor_status -total until there is a Claimed slot, can also check condor_status | grep Claimed, it should be Claimed Idle
5. remove the job (condor_rm -a will do it)

  
Actual results:

The Claimed slot will remain that way until the claim lease expires. This can be verified by watching the StartLog on the slot's machine.


Expected results:

The slot should be returned to the Unclaimed state.


Additional info:

Running a job on a slot is a multi-step process. Three steps the Schedd goes through are to 1) REQUEST_CLAIM on the Startd 2) start a Shadow for the job 3) let the Shadow ACTIVATE_CLAIM on the Startd. The Schedd will RELEASE_CLAIM on the Startd for all jobs that have an active Shadow. Anything that prevents step (2) from succeeding, such as the Shadow cannot be executed (induced above) or a condor_rm -a while the Schedd is loaded (original way of hitting this bug) and Shadow execution is cancelled, will demonstrate this bug.

The Schedd keeps match_rec's for each job it is starting, the match_rec has a status field that can indicate M_STARTD_CONTACT_LIMBO (step 1 completed), M_CLAIMED (step 2 completed). The Schedd sends RELEASE_CLAIM on M_CLAIMED and should be changed to on M_STARTD_CONTACT_LIMBO as well.

Comment 2 Matthew Farrellee 2009-09-15 17:51:48 UTC
Resolved upstream in V7_4-branch, will be in 7.4.0-0.5

http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=742

Comment 4 Jan Sarenik 2009-10-29 10:14:46 UTC
On condor-7.4.1-0.2.el5 slot is not returned to the Unclaimed state
the same way as on condor-7.4.0-0.4.el5.

Comment 5 Irina Boverman 2009-10-29 14:30:01 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
please see bug summary.

Comment 6 Matthew Farrellee 2009-10-30 12:01:17 UTC
Deleted Release Notes Contents.

Old Contents:
please see bug summary.

Comment 7 Matthew Farrellee 2009-11-03 16:18:03 UTC
After some further investigation, the amount of time until the Schedd sends a RELEASE_CLAIM is controlled by SCHEDD_INTERVAL (defaults to 300 seconds). I believe this is working as expected.

Comment 8 Jan Sarenik 2009-11-03 17:25:22 UTC
Created attachment 367329 [details]
Simple verification script

Comment 9 Jan Sarenik 2009-11-03 17:32:08 UTC
condor-7.4.1-0.2.el5
condor-7.4.1-0.2.el4

VERIFIED

Comment 10 Irina Boverman 2009-11-09 19:19:05 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
see summary

Comment 11 Lana Brindley 2009-11-11 20:33:40 UTC
(In reply to comment #7)
> After some further investigation, the amount of time until the Schedd sends a
> RELEASE_CLAIM is controlled by SCHEDD_INTERVAL (defaults to 300 seconds). I
> believe this is working as expected.  

So ... no release note required?

LKB

Comment 12 Matthew Farrellee 2009-11-11 21:30:02 UTC
Still need relnote...

C: A failure a submit node when starting a job can result in a slot remaining in the Claimed/Idle state/activity without a job until the claim lease expires.
C: The condor_schedd would claim a slot but not release it on failure to execute the condor_shadow.
F: The condor_schedd now releases slots on failure to execute the condor_shadow.
R: The slot will not return to Unclaimed/Idle instead of remaining Claimed/Idle.

Comment 13 Jan Sarenik 2009-11-12 09:59:56 UTC
(In reply to comment #11)
> So ... no release note required?

Lana, as far as I understand that was reply to my confusion
that the slots did not get freed as quickly as I expected.
Then I used the mentioned variable, to see them being
freed more quickly, which did not happen on the buggy
version.

Comment 14 Jan Sarenik 2009-11-12 10:00:40 UTC
(In reply to comment #13)
> Lana, as far as I understand that was reply to my confusion

_that_ stands for comment #7 of course.

Comment 15 Lana Brindley 2009-11-19 00:40:46 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1,8 @@
-see summary+Grid bug fix
+
+C: A failure a submit node when starting a job
+C: The condor_schedd would claim a slot but not release it on failure to execute the condor_shadow.
+F: The condor_schedd now releases slots on failure to execute the condor_shadow.
+R: The slot will now return to Unclaimed/Idle instead of remaining Claimed/Idle. 
+
+If a submit node failed to start a job correctly, the condor_schedd would claim a slot and then never release it. The condor_schedd has now been altered so that it releases slots on failure to execute the condor_shadow. The slot will now return to Unclaimed/Idle instead of remaining in Claimed/Idle.

Comment 17 errata-xmlrpc 2009-12-03 09:19:37 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1633.html


Note You need to log in before you can comment on or make changes to this bug.