Bug 707770

Summary:	Dedicated scheduler releases half of claims
Product:	Red Hat Enterprise MRG	Reporter:	Erik Erlandson <eerlands>
Component:	condor	Assignee:	Erik Erlandson <eerlands>
Status:	CLOSED ERRATA	QA Contact:	Tomas Rusnak <trusnak>
Severity:	unspecified	Docs Contact:
Priority:	high
Version:	1.3	CC:	jneedle, matt, mkudlej, trusnak, tstclair
Target Milestone:	2.0.1
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	condor-7.6.2-0.1	Doc Type:	Bug Fix
Doc Text:	Cause: The logic for deactivating claims in the dedicated scheduler iterated over a claim data structure while that structure was being destructively modified. Consequence: Some of the claims were not properly released. Fix: Logic was corrected to save structure pointers in temporary container while the claims are deactivated. Result: All claims are now properly deactivated.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-09-07 16:41:37 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	723887

Description Erik Erlandson 2011-05-25 21:30:25 UTC

Description of problem:
If the parallel shadow for a multi-node (i.e. machine_count > 1) process exits abnormally, the dedicated scheduler only releases half of the claims for that process. This tends to result in a job that appears to move back and forth between the idle and running states, with the shadow exiting each time with status 108. The ShadowLog notes that a particular machine(s) refused to start the job:

  Request to run on <XXX.XXX.XXX.XX:56351> <XXX.XXX.XXX.XXX:56351> was REFUSED

The offending StartLog shows

  Error: can't find resource with ClaimId (<XXX.XXX.XXX.XXX:56351>#1303787222#1#...) -- perhaps this claim was already removed?

For half of the nodes in a process, the dedicated scheduler notes:

  ERROR in releaseClaim(): NULL m_rec

I believe the problem originates in shutdownMPIJob() in dedicated_scheduler.cpp, which attempts to perform a releaseClaim() on all nodes in a process. The current loop doesn't work because the length of the list continues to change due to calls to DelMrec(). This can be solved by releasing the claims in the opposite order.


How reproducible:


Steps to Reproduce: to-do, construct a repro.  Likely detect presence of 108 shadow exit and/or startlog error messages.

Actual results:
parallel jobs failing to release half claims

Expected results:
all claims properly released.

Additional info:
upstream https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2088

Comment 1 Erik Erlandson 2011-05-31 22:51:10 UTC

Fix is currently upstream on uw/V7_6-branch:
https://condor-wiki.cs.wisc.edu/index.cgi/chngview?cn=22052

Comment 2 Erik Erlandson 2011-05-31 22:52:41 UTC

Repro/test:

using the following configuration:

# dedicated scheduler
DedicatedScheduler = "DedicatedScheduler"
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler
RANK = 0 + (Scheduler =?= $(DedicatedScheduler))
MPI_CONDOR_RSH_PATH = $(LIBEXEC)
CONDOR_SSHD = /usr/sbin/sshd
CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
# slots
NUM_CPUS = 10


using the following submit file:

universe = parallel
cmd = /bin/sleep
args = 300
should_transfer_files = if_needed
when_to_transfer_output = on_exit
machine_count=4
+AccountingGroup = "none.user0000"
queue 1


Before fix: submit the job, and let it begin running. Then invoke a "condor_vacate_job".

You will see null-m_rec errors from the ded-schedd:

$ tail -f SchedLog | grep -e 'ERROR.*NULL m_rec'
05/26/11 21:54:02 (pid:11721) ERROR in releaseClaim(): NULL m_rec
05/26/11 21:54:02 (pid:11721) ERROR in releaseClaim(): NULL m_rec


After the fix, there will be no null m_rec errors.

Comment 3 Erik Erlandson 2011-05-31 22:57:48 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause:
The logic for deactivating claims in the dedicated scheduler iterated over a claim data structure while that structure was being destructively modified.

Consequence:
Some of the claims were not properly released.

Fix:
Logic was corrected to save structure pointers in temporary container while the claims are deactivated.

Result:
All claims are now properly deactivated.

Comment 5 Tomas Rusnak 2011-07-22 14:38:36 UTC

When I try to reproduce it, the SchedLog gives me following:

07/22/11 17:27:18 (pid:24818) Trying to find 4 resource(s) for dedicated job 1.0
07/22/11 17:27:18 (pid:24818) Trying to satisfy job with all possible resources
07/22/11 17:27:18 (pid:24818) Can't satisfy job 1 with all possible resources... trying next job
07/22/11 17:27:18 (pid:24818) In DedicatedScheduler::publishRequestAd()

The job is still in idle state. I also looked into original reproducer at condorwiki but it is same. 

Please, could you check the reproducing method once again?

Comment 6 Erik Erlandson 2011-07-22 19:47:44 UTC

(In reply to comment #5)

> 07/22/11 17:27:18 (pid:24818) Can't satisfy job 1 with all possible
> resources... trying next job

Are there >= 4 slots defined on the pool?  If so, I'd be interested in output of condor_q -better-analyze.

Comment 7 Tomas Rusnak 2011-07-25 08:39:59 UTC

As in repro comment I set NUM_CPUS = 10.

# condor_q -better-analyze
-- Submitter: localhost : <IP:45008> : localhost
---
002.000:  Request has not yet been considered by the matchmaker.

SchedLog:

07/25/11 11:22:13 (pid:24818) Starting DedicatedScheduler::handleDedicatedJobs
07/25/11 11:22:14 (pid:24818) Inserting new attribute Scheduler into non-active cluster cid=2 acid=-1
07/25/11 11:22:14 (pid:24818) Found 1 idle dedicated job(s)
07/25/11 11:22:14 (pid:24818) DedicatedScheduler: Listing all dedicated jobs -
07/25/11 11:22:14 (pid:24818) Dedicated job: 2.0 test
07/25/11 11:22:14 (pid:24818) Trying to query collector <IP:9618>
07/25/11 11:22:14 (pid:24818) Found 0 potential dedicated resources
07/25/11 11:22:14 (pid:24818) idle resource list
07/25/11 11:22:14 (pid:24818)  ************ empty ************
07/25/11 11:22:14 (pid:24818) limbo resource list
07/25/11 11:22:14 (pid:24818)  ************ empty ************
07/25/11 11:22:14 (pid:24818) unclaimed resource list
07/25/11 11:22:14 (pid:24818)  ************ empty ************
07/25/11 11:22:14 (pid:24818) busy resource list
07/25/11 11:22:14 (pid:24818)  ************ empty ************
07/25/11 11:22:14 (pid:24818) Trying to find 4 resource(s) for dedicated job 2.0
07/25/11 11:22:14 (pid:24818) Trying to satisfy job with all possible resources
07/25/11 11:22:14 (pid:24818) Can't satisfy job 2 with all possible resources... trying next job
07/25/11 11:22:14 (pid:24818) In DedicatedScheduler::publishRequestAd()

I can't find reason why Schedd can't find resources for that job.

Comment 9 Tomas Rusnak 2011-07-25 16:47:46 UTC

Thanks Erik. Reproduced on:

$CondorVersion: 7.6.1 Jun 02 2011 BuildID: RH-7.6.1-0.10.el5 $
$CondorPlatform: X86_64-RedHat_5.6 $

tail -f /var/log/condor/SchedLog | grep -e 'ERROR.*NULL m_rec'
07/25/11 19:32:34 (pid:16953) ERROR in releaseClaim(): NULL m_rec
07/25/11 19:32:34 (pid:16953) ERROR in releaseClaim(): NULL m_rec

Comment 10 Tomas Rusnak 2011-07-26 10:30:32 UTC

Retested over all supported platforms x86,x86_64/RHEL5,RHEL6 with:

condor-7.6.3-0.2

# condor_q
-- Submitter: localhost: <IP:41622> : localhost
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
 365.0   test            7/26 11:33   0+00:00:25 R  0   0.0  sleep 300         

1 jobs; 0 idle, 1 running, 0 held
# condor_vacate_job 365.0
Job 365.0 vacated
# tail -f /var/log/condor/SchedLog | grep -e 'ERROR.*NULL m_rec'
^C

No null m_rec errors found in scheduler log.

>>> VERIFIED

Comment 11 errata-xmlrpc 2011-09-07 16:41:37 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1249.html