Bug 523482
Summary: | Slots left Claimed Idle when there are no jobs | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Matthew Farrellee <matt> | ||||
Component: | condor | Assignee: | Matthew Farrellee <matt> | ||||
Status: | CLOSED ERRATA | QA Contact: | Jan Sarenik <jsarenik> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 1.1 | CC: | iboverma, jsarenik, lbrindle | ||||
Target Milestone: | 1.2 | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Grid bug fix
C: A failure a submit node when starting a job
C: The condor_schedd would claim a slot but not release it on failure to execute the condor_shadow.
F: The condor_schedd now releases slots on failure to execute the condor_shadow.
R: The slot will now return to Unclaimed/Idle instead of remaining Claimed/Idle.
If a submit node failed to start a job correctly, the condor_schedd would claim a slot and then never release it. The condor_schedd has now been altered so that it releases slots on failure to execute the condor_shadow. The slot will now return to Unclaimed/Idle instead of remaining in Claimed/Idle.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2009-12-03 09:19:37 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 527551 | ||||||
Attachments: |
|
Description
Matthew Farrellee
2009-09-15 16:46:17 UTC
Resolved upstream in V7_4-branch, will be in 7.4.0-0.5 http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=742 On condor-7.4.1-0.2.el5 slot is not returned to the Unclaimed state the same way as on condor-7.4.0-0.4.el5. Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: please see bug summary. Deleted Release Notes Contents. Old Contents: please see bug summary. After some further investigation, the amount of time until the Schedd sends a RELEASE_CLAIM is controlled by SCHEDD_INTERVAL (defaults to 300 seconds). I believe this is working as expected. Created attachment 367329 [details]
Simple verification script
condor-7.4.1-0.2.el5 condor-7.4.1-0.2.el4 VERIFIED Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: see summary (In reply to comment #7) > After some further investigation, the amount of time until the Schedd sends a > RELEASE_CLAIM is controlled by SCHEDD_INTERVAL (defaults to 300 seconds). I > believe this is working as expected. So ... no release note required? LKB Still need relnote... C: A failure a submit node when starting a job can result in a slot remaining in the Claimed/Idle state/activity without a job until the claim lease expires. C: The condor_schedd would claim a slot but not release it on failure to execute the condor_shadow. F: The condor_schedd now releases slots on failure to execute the condor_shadow. R: The slot will not return to Unclaimed/Idle instead of remaining Claimed/Idle. (In reply to comment #11) > So ... no release note required? Lana, as far as I understand that was reply to my confusion that the slots did not get freed as quickly as I expected. Then I used the mentioned variable, to see them being freed more quickly, which did not happen on the buggy version. (In reply to comment #13) > Lana, as far as I understand that was reply to my confusion _that_ stands for comment #7 of course. Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1,8 @@ -see summary+Grid bug fix + +C: A failure a submit node when starting a job +C: The condor_schedd would claim a slot but not release it on failure to execute the condor_shadow. +F: The condor_schedd now releases slots on failure to execute the condor_shadow. +R: The slot will now return to Unclaimed/Idle instead of remaining Claimed/Idle. + +If a submit node failed to start a job correctly, the condor_schedd would claim a slot and then never release it. The condor_schedd has now been altered so that it releases slots on failure to execute the condor_shadow. The slot will now return to Unclaimed/Idle instead of remaining in Claimed/Idle. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-1633.html |