Bug 707770
Summary: | Dedicated scheduler releases half of claims | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Erik Erlandson <eerlands> |
Component: | condor | Assignee: | Erik Erlandson <eerlands> |
Status: | CLOSED ERRATA | QA Contact: | Tomas Rusnak <trusnak> |
Severity: | unspecified | Docs Contact: | |
Priority: | high | ||
Version: | 1.3 | CC: | jneedle, matt, mkudlej, trusnak, tstclair |
Target Milestone: | 2.0.1 | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | condor-7.6.2-0.1 | Doc Type: | Bug Fix |
Doc Text: |
Cause:
The logic for deactivating claims in the dedicated scheduler iterated over a claim data structure while that structure was being destructively modified.
Consequence:
Some of the claims were not properly released.
Fix:
Logic was corrected to save structure pointers in temporary container while the claims are deactivated.
Result:
All claims are now properly deactivated.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2011-09-07 16:41:37 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 723887 |
Description
Erik Erlandson
2011-05-25 21:30:25 UTC
Fix is currently upstream on uw/V7_6-branch: https://condor-wiki.cs.wisc.edu/index.cgi/chngview?cn=22052 Repro/test: using the following configuration: # dedicated scheduler DedicatedScheduler = "DedicatedScheduler" STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler RANK = 0 + (Scheduler =?= $(DedicatedScheduler)) MPI_CONDOR_RSH_PATH = $(LIBEXEC) CONDOR_SSHD = /usr/sbin/sshd CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen # slots NUM_CPUS = 10 using the following submit file: universe = parallel cmd = /bin/sleep args = 300 should_transfer_files = if_needed when_to_transfer_output = on_exit machine_count=4 +AccountingGroup = "none.user0000" queue 1 Before fix: submit the job, and let it begin running. Then invoke a "condor_vacate_job". You will see null-m_rec errors from the ded-schedd: $ tail -f SchedLog | grep -e 'ERROR.*NULL m_rec' 05/26/11 21:54:02 (pid:11721) ERROR in releaseClaim(): NULL m_rec 05/26/11 21:54:02 (pid:11721) ERROR in releaseClaim(): NULL m_rec After the fix, there will be no null m_rec errors. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: The logic for deactivating claims in the dedicated scheduler iterated over a claim data structure while that structure was being destructively modified. Consequence: Some of the claims were not properly released. Fix: Logic was corrected to save structure pointers in temporary container while the claims are deactivated. Result: All claims are now properly deactivated. When I try to reproduce it, the SchedLog gives me following: 07/22/11 17:27:18 (pid:24818) Trying to find 4 resource(s) for dedicated job 1.0 07/22/11 17:27:18 (pid:24818) Trying to satisfy job with all possible resources 07/22/11 17:27:18 (pid:24818) Can't satisfy job 1 with all possible resources... trying next job 07/22/11 17:27:18 (pid:24818) In DedicatedScheduler::publishRequestAd() The job is still in idle state. I also looked into original reproducer at condorwiki but it is same. Please, could you check the reproducing method once again? (In reply to comment #5) > 07/22/11 17:27:18 (pid:24818) Can't satisfy job 1 with all possible > resources... trying next job Are there >= 4 slots defined on the pool? If so, I'd be interested in output of condor_q -better-analyze. As in repro comment I set NUM_CPUS = 10. # condor_q -better-analyze -- Submitter: localhost : <IP:45008> : localhost --- 002.000: Request has not yet been considered by the matchmaker. SchedLog: 07/25/11 11:22:13 (pid:24818) Starting DedicatedScheduler::handleDedicatedJobs 07/25/11 11:22:14 (pid:24818) Inserting new attribute Scheduler into non-active cluster cid=2 acid=-1 07/25/11 11:22:14 (pid:24818) Found 1 idle dedicated job(s) 07/25/11 11:22:14 (pid:24818) DedicatedScheduler: Listing all dedicated jobs - 07/25/11 11:22:14 (pid:24818) Dedicated job: 2.0 test 07/25/11 11:22:14 (pid:24818) Trying to query collector <IP:9618> 07/25/11 11:22:14 (pid:24818) Found 0 potential dedicated resources 07/25/11 11:22:14 (pid:24818) idle resource list 07/25/11 11:22:14 (pid:24818) ************ empty ************ 07/25/11 11:22:14 (pid:24818) limbo resource list 07/25/11 11:22:14 (pid:24818) ************ empty ************ 07/25/11 11:22:14 (pid:24818) unclaimed resource list 07/25/11 11:22:14 (pid:24818) ************ empty ************ 07/25/11 11:22:14 (pid:24818) busy resource list 07/25/11 11:22:14 (pid:24818) ************ empty ************ 07/25/11 11:22:14 (pid:24818) Trying to find 4 resource(s) for dedicated job 2.0 07/25/11 11:22:14 (pid:24818) Trying to satisfy job with all possible resources 07/25/11 11:22:14 (pid:24818) Can't satisfy job 2 with all possible resources... trying next job 07/25/11 11:22:14 (pid:24818) In DedicatedScheduler::publishRequestAd() I can't find reason why Schedd can't find resources for that job. Thanks Erik. Reproduced on: $CondorVersion: 7.6.1 Jun 02 2011 BuildID: RH-7.6.1-0.10.el5 $ $CondorPlatform: X86_64-RedHat_5.6 $ tail -f /var/log/condor/SchedLog | grep -e 'ERROR.*NULL m_rec' 07/25/11 19:32:34 (pid:16953) ERROR in releaseClaim(): NULL m_rec 07/25/11 19:32:34 (pid:16953) ERROR in releaseClaim(): NULL m_rec Retested over all supported platforms x86,x86_64/RHEL5,RHEL6 with:
condor-7.6.3-0.2
# condor_q
-- Submitter: localhost: <IP:41622> : localhost
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
365.0 test 7/26 11:33 0+00:00:25 R 0 0.0 sleep 300
1 jobs; 0 idle, 1 running, 0 held
# condor_vacate_job 365.0
Job 365.0 vacated
# tail -f /var/log/condor/SchedLog | grep -e 'ERROR.*NULL m_rec'
^C
No null m_rec errors found in scheduler log.
>>> VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1249.html |