Description of problem: It is possible to have slots left in the Claimed state with no running jobs, and no jobs to run!, until the claim lease expires. Version-Release number of selected component (if applicable): condor-7.4.0-0.4 How reproducible: 100% Steps to Reproduce: 1. start condor (make sure schedd start) 2. mv /usr/sbin/condor_shadow /usr/sbin/condor_shadow- 3. submit a job (echo -e 'executable=/bin/sleep\narguments=5\ntransfer_executable=false\nnotification=never\nqueue 1\n' | condor_submit) 4. watch condor_status -total until there is a Claimed slot, can also check condor_status | grep Claimed, it should be Claimed Idle 5. remove the job (condor_rm -a will do it) Actual results: The Claimed slot will remain that way until the claim lease expires. This can be verified by watching the StartLog on the slot's machine. Expected results: The slot should be returned to the Unclaimed state. Additional info: Running a job on a slot is a multi-step process. Three steps the Schedd goes through are to 1) REQUEST_CLAIM on the Startd 2) start a Shadow for the job 3) let the Shadow ACTIVATE_CLAIM on the Startd. The Schedd will RELEASE_CLAIM on the Startd for all jobs that have an active Shadow. Anything that prevents step (2) from succeeding, such as the Shadow cannot be executed (induced above) or a condor_rm -a while the Schedd is loaded (original way of hitting this bug) and Shadow execution is cancelled, will demonstrate this bug. The Schedd keeps match_rec's for each job it is starting, the match_rec has a status field that can indicate M_STARTD_CONTACT_LIMBO (step 1 completed), M_CLAIMED (step 2 completed). The Schedd sends RELEASE_CLAIM on M_CLAIMED and should be changed to on M_STARTD_CONTACT_LIMBO as well.
Resolved upstream in V7_4-branch, will be in 7.4.0-0.5 http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=742
On condor-7.4.1-0.2.el5 slot is not returned to the Unclaimed state the same way as on condor-7.4.0-0.4.el5.
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: please see bug summary.
Deleted Release Notes Contents. Old Contents: please see bug summary.
After some further investigation, the amount of time until the Schedd sends a RELEASE_CLAIM is controlled by SCHEDD_INTERVAL (defaults to 300 seconds). I believe this is working as expected.
Created attachment 367329 [details] Simple verification script
condor-7.4.1-0.2.el5 condor-7.4.1-0.2.el4 VERIFIED
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: see summary
(In reply to comment #7) > After some further investigation, the amount of time until the Schedd sends a > RELEASE_CLAIM is controlled by SCHEDD_INTERVAL (defaults to 300 seconds). I > believe this is working as expected. So ... no release note required? LKB
Still need relnote... C: A failure a submit node when starting a job can result in a slot remaining in the Claimed/Idle state/activity without a job until the claim lease expires. C: The condor_schedd would claim a slot but not release it on failure to execute the condor_shadow. F: The condor_schedd now releases slots on failure to execute the condor_shadow. R: The slot will not return to Unclaimed/Idle instead of remaining Claimed/Idle.
(In reply to comment #11) > So ... no release note required? Lana, as far as I understand that was reply to my confusion that the slots did not get freed as quickly as I expected. Then I used the mentioned variable, to see them being freed more quickly, which did not happen on the buggy version.
(In reply to comment #13) > Lana, as far as I understand that was reply to my confusion _that_ stands for comment #7 of course.
Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1,8 @@ -see summary+Grid bug fix + +C: A failure a submit node when starting a job +C: The condor_schedd would claim a slot but not release it on failure to execute the condor_shadow. +F: The condor_schedd now releases slots on failure to execute the condor_shadow. +R: The slot will now return to Unclaimed/Idle instead of remaining Claimed/Idle. + +If a submit node failed to start a job correctly, the condor_schedd would claim a slot and then never release it. The condor_schedd has now been altered so that it releases slots on failure to execute the condor_shadow. The slot will now return to Unclaimed/Idle instead of remaining in Claimed/Idle.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-1633.html