Description of problem: If the parallel shadow for a multi-node (i.e. machine_count > 1) process exits abnormally, the dedicated scheduler only releases half of the claims for that process. This tends to result in a job that appears to move back and forth between the idle and running states, with the shadow exiting each time with status 108. The ShadowLog notes that a particular machine(s) refused to start the job: Request to run on <XXX.XXX.XXX.XX:56351> <XXX.XXX.XXX.XXX:56351> was REFUSED The offending StartLog shows Error: can't find resource with ClaimId (<XXX.XXX.XXX.XXX:56351>#1303787222#1#...) -- perhaps this claim was already removed? For half of the nodes in a process, the dedicated scheduler notes: ERROR in releaseClaim(): NULL m_rec I believe the problem originates in shutdownMPIJob() in dedicated_scheduler.cpp, which attempts to perform a releaseClaim() on all nodes in a process. The current loop doesn't work because the length of the list continues to change due to calls to DelMrec(). This can be solved by releasing the claims in the opposite order. How reproducible: Steps to Reproduce: to-do, construct a repro. Likely detect presence of 108 shadow exit and/or startlog error messages. Actual results: parallel jobs failing to release half claims Expected results: all claims properly released. Additional info: upstream https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2088
Fix is currently upstream on uw/V7_6-branch: https://condor-wiki.cs.wisc.edu/index.cgi/chngview?cn=22052
Repro/test: using the following configuration: # dedicated scheduler DedicatedScheduler = "DedicatedScheduler" STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler RANK = 0 + (Scheduler =?= $(DedicatedScheduler)) MPI_CONDOR_RSH_PATH = $(LIBEXEC) CONDOR_SSHD = /usr/sbin/sshd CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen # slots NUM_CPUS = 10 using the following submit file: universe = parallel cmd = /bin/sleep args = 300 should_transfer_files = if_needed when_to_transfer_output = on_exit machine_count=4 +AccountingGroup = "none.user0000" queue 1 Before fix: submit the job, and let it begin running. Then invoke a "condor_vacate_job". You will see null-m_rec errors from the ded-schedd: $ tail -f SchedLog | grep -e 'ERROR.*NULL m_rec' 05/26/11 21:54:02 (pid:11721) ERROR in releaseClaim(): NULL m_rec 05/26/11 21:54:02 (pid:11721) ERROR in releaseClaim(): NULL m_rec After the fix, there will be no null m_rec errors.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: The logic for deactivating claims in the dedicated scheduler iterated over a claim data structure while that structure was being destructively modified. Consequence: Some of the claims were not properly released. Fix: Logic was corrected to save structure pointers in temporary container while the claims are deactivated. Result: All claims are now properly deactivated.
When I try to reproduce it, the SchedLog gives me following: 07/22/11 17:27:18 (pid:24818) Trying to find 4 resource(s) for dedicated job 1.0 07/22/11 17:27:18 (pid:24818) Trying to satisfy job with all possible resources 07/22/11 17:27:18 (pid:24818) Can't satisfy job 1 with all possible resources... trying next job 07/22/11 17:27:18 (pid:24818) In DedicatedScheduler::publishRequestAd() The job is still in idle state. I also looked into original reproducer at condorwiki but it is same. Please, could you check the reproducing method once again?
(In reply to comment #5) > 07/22/11 17:27:18 (pid:24818) Can't satisfy job 1 with all possible > resources... trying next job Are there >= 4 slots defined on the pool? If so, I'd be interested in output of condor_q -better-analyze.
As in repro comment I set NUM_CPUS = 10. # condor_q -better-analyze -- Submitter: localhost : <IP:45008> : localhost --- 002.000: Request has not yet been considered by the matchmaker. SchedLog: 07/25/11 11:22:13 (pid:24818) Starting DedicatedScheduler::handleDedicatedJobs 07/25/11 11:22:14 (pid:24818) Inserting new attribute Scheduler into non-active cluster cid=2 acid=-1 07/25/11 11:22:14 (pid:24818) Found 1 idle dedicated job(s) 07/25/11 11:22:14 (pid:24818) DedicatedScheduler: Listing all dedicated jobs - 07/25/11 11:22:14 (pid:24818) Dedicated job: 2.0 test 07/25/11 11:22:14 (pid:24818) Trying to query collector <IP:9618> 07/25/11 11:22:14 (pid:24818) Found 0 potential dedicated resources 07/25/11 11:22:14 (pid:24818) idle resource list 07/25/11 11:22:14 (pid:24818) ************ empty ************ 07/25/11 11:22:14 (pid:24818) limbo resource list 07/25/11 11:22:14 (pid:24818) ************ empty ************ 07/25/11 11:22:14 (pid:24818) unclaimed resource list 07/25/11 11:22:14 (pid:24818) ************ empty ************ 07/25/11 11:22:14 (pid:24818) busy resource list 07/25/11 11:22:14 (pid:24818) ************ empty ************ 07/25/11 11:22:14 (pid:24818) Trying to find 4 resource(s) for dedicated job 2.0 07/25/11 11:22:14 (pid:24818) Trying to satisfy job with all possible resources 07/25/11 11:22:14 (pid:24818) Can't satisfy job 2 with all possible resources... trying next job 07/25/11 11:22:14 (pid:24818) In DedicatedScheduler::publishRequestAd() I can't find reason why Schedd can't find resources for that job.
Thanks Erik. Reproduced on: $CondorVersion: 7.6.1 Jun 02 2011 BuildID: RH-7.6.1-0.10.el5 $ $CondorPlatform: X86_64-RedHat_5.6 $ tail -f /var/log/condor/SchedLog | grep -e 'ERROR.*NULL m_rec' 07/25/11 19:32:34 (pid:16953) ERROR in releaseClaim(): NULL m_rec 07/25/11 19:32:34 (pid:16953) ERROR in releaseClaim(): NULL m_rec
Retested over all supported platforms x86,x86_64/RHEL5,RHEL6 with: condor-7.6.3-0.2 # condor_q -- Submitter: localhost: <IP:41622> : localhost ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 365.0 test 7/26 11:33 0+00:00:25 R 0 0.0 sleep 300 1 jobs; 0 idle, 1 running, 0 held # condor_vacate_job 365.0 Job 365.0 vacated # tail -f /var/log/condor/SchedLog | grep -e 'ERROR.*NULL m_rec' ^C No null m_rec errors found in scheduler log. >>> VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1249.html