Bug 707770 - Dedicated scheduler releases half of claims
Summary: Dedicated scheduler releases half of claims
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.3
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: 2.0.1
: ---
Assignee: Erik Erlandson
QA Contact: Tomas Rusnak
URL:
Whiteboard:
Depends On:
Blocks: 723887
TreeView+ depends on / blocked
 
Reported: 2011-05-25 21:30 UTC by Erik Erlandson
Modified: 2011-09-07 16:41 UTC (History)
5 users (show)

Fixed In Version: condor-7.6.2-0.1
Doc Type: Bug Fix
Doc Text:
Cause: The logic for deactivating claims in the dedicated scheduler iterated over a claim data structure while that structure was being destructively modified. Consequence: Some of the claims were not properly released. Fix: Logic was corrected to save structure pointers in temporary container while the claims are deactivated. Result: All claims are now properly deactivated.
Clone Of:
Environment:
Last Closed: 2011-09-07 16:41:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:1249 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Grid 2.0 security, bug fix and enhancement update 2011-09-07 16:40:45 UTC

Description Erik Erlandson 2011-05-25 21:30:25 UTC
Description of problem:
If the parallel shadow for a multi-node (i.e. machine_count > 1) process exits abnormally, the dedicated scheduler only releases half of the claims for that process. This tends to result in a job that appears to move back and forth between the idle and running states, with the shadow exiting each time with status 108. The ShadowLog notes that a particular machine(s) refused to start the job:

  Request to run on <XXX.XXX.XXX.XX:56351> <XXX.XXX.XXX.XXX:56351> was REFUSED

The offending StartLog shows

  Error: can't find resource with ClaimId (<XXX.XXX.XXX.XXX:56351>#1303787222#1#...) -- perhaps this claim was already removed?

For half of the nodes in a process, the dedicated scheduler notes:

  ERROR in releaseClaim(): NULL m_rec

I believe the problem originates in shutdownMPIJob() in dedicated_scheduler.cpp, which attempts to perform a releaseClaim() on all nodes in a process. The current loop doesn't work because the length of the list continues to change due to calls to DelMrec(). This can be solved by releasing the claims in the opposite order.


How reproducible:


Steps to Reproduce: to-do, construct a repro.  Likely detect presence of 108 shadow exit and/or startlog error messages.

Actual results:
parallel jobs failing to release half claims

Expected results:
all claims properly released.

Additional info:
upstream https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2088

Comment 1 Erik Erlandson 2011-05-31 22:51:10 UTC
Fix is currently upstream on uw/V7_6-branch:
https://condor-wiki.cs.wisc.edu/index.cgi/chngview?cn=22052

Comment 2 Erik Erlandson 2011-05-31 22:52:41 UTC
Repro/test:

using the following configuration:

# dedicated scheduler
DedicatedScheduler = "DedicatedScheduler"
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler
RANK = 0 + (Scheduler =?= $(DedicatedScheduler))
MPI_CONDOR_RSH_PATH = $(LIBEXEC)
CONDOR_SSHD = /usr/sbin/sshd
CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
# slots
NUM_CPUS = 10


using the following submit file:

universe = parallel
cmd = /bin/sleep
args = 300
should_transfer_files = if_needed
when_to_transfer_output = on_exit
machine_count=4
+AccountingGroup = "none.user0000"
queue 1


Before fix: submit the job, and let it begin running. Then invoke a "condor_vacate_job".

You will see null-m_rec errors from the ded-schedd:

$ tail -f SchedLog | grep -e 'ERROR.*NULL m_rec'
05/26/11 21:54:02 (pid:11721) ERROR in releaseClaim(): NULL m_rec
05/26/11 21:54:02 (pid:11721) ERROR in releaseClaim(): NULL m_rec


After the fix, there will be no null m_rec errors.

Comment 3 Erik Erlandson 2011-05-31 22:57:48 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause:
The logic for deactivating claims in the dedicated scheduler iterated over a claim data structure while that structure was being destructively modified.

Consequence:
Some of the claims were not properly released.

Fix:
Logic was corrected to save structure pointers in temporary container while the claims are deactivated.

Result:
All claims are now properly deactivated.

Comment 5 Tomas Rusnak 2011-07-22 14:38:36 UTC
When I try to reproduce it, the SchedLog gives me following:

07/22/11 17:27:18 (pid:24818) Trying to find 4 resource(s) for dedicated job 1.0
07/22/11 17:27:18 (pid:24818) Trying to satisfy job with all possible resources
07/22/11 17:27:18 (pid:24818) Can't satisfy job 1 with all possible resources... trying next job
07/22/11 17:27:18 (pid:24818) In DedicatedScheduler::publishRequestAd()

The job is still in idle state. I also looked into original reproducer at condorwiki but it is same. 

Please, could you check the reproducing method once again?

Comment 6 Erik Erlandson 2011-07-22 19:47:44 UTC
(In reply to comment #5)

> 07/22/11 17:27:18 (pid:24818) Can't satisfy job 1 with all possible
> resources... trying next job

Are there >= 4 slots defined on the pool?  If so, I'd be interested in output of condor_q -better-analyze.

Comment 7 Tomas Rusnak 2011-07-25 08:39:59 UTC
As in repro comment I set NUM_CPUS = 10.

# condor_q -better-analyze
-- Submitter: localhost : <IP:45008> : localhost
---
002.000:  Request has not yet been considered by the matchmaker.

SchedLog:

07/25/11 11:22:13 (pid:24818) Starting DedicatedScheduler::handleDedicatedJobs
07/25/11 11:22:14 (pid:24818) Inserting new attribute Scheduler into non-active cluster cid=2 acid=-1
07/25/11 11:22:14 (pid:24818) Found 1 idle dedicated job(s)
07/25/11 11:22:14 (pid:24818) DedicatedScheduler: Listing all dedicated jobs -
07/25/11 11:22:14 (pid:24818) Dedicated job: 2.0 test
07/25/11 11:22:14 (pid:24818) Trying to query collector <IP:9618>
07/25/11 11:22:14 (pid:24818) Found 0 potential dedicated resources
07/25/11 11:22:14 (pid:24818) idle resource list
07/25/11 11:22:14 (pid:24818)  ************ empty ************
07/25/11 11:22:14 (pid:24818) limbo resource list
07/25/11 11:22:14 (pid:24818)  ************ empty ************
07/25/11 11:22:14 (pid:24818) unclaimed resource list
07/25/11 11:22:14 (pid:24818)  ************ empty ************
07/25/11 11:22:14 (pid:24818) busy resource list
07/25/11 11:22:14 (pid:24818)  ************ empty ************
07/25/11 11:22:14 (pid:24818) Trying to find 4 resource(s) for dedicated job 2.0
07/25/11 11:22:14 (pid:24818) Trying to satisfy job with all possible resources
07/25/11 11:22:14 (pid:24818) Can't satisfy job 2 with all possible resources... trying next job
07/25/11 11:22:14 (pid:24818) In DedicatedScheduler::publishRequestAd()

I can't find reason why Schedd can't find resources for that job.

Comment 9 Tomas Rusnak 2011-07-25 16:47:46 UTC
Thanks Erik. Reproduced on:

$CondorVersion: 7.6.1 Jun 02 2011 BuildID: RH-7.6.1-0.10.el5 $
$CondorPlatform: X86_64-RedHat_5.6 $

tail -f /var/log/condor/SchedLog | grep -e 'ERROR.*NULL m_rec'
07/25/11 19:32:34 (pid:16953) ERROR in releaseClaim(): NULL m_rec
07/25/11 19:32:34 (pid:16953) ERROR in releaseClaim(): NULL m_rec

Comment 10 Tomas Rusnak 2011-07-26 10:30:32 UTC
Retested over all supported platforms x86,x86_64/RHEL5,RHEL6 with:

condor-7.6.3-0.2

# condor_q
-- Submitter: localhost: <IP:41622> : localhost
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
 365.0   test            7/26 11:33   0+00:00:25 R  0   0.0  sleep 300         

1 jobs; 0 idle, 1 running, 0 held
# condor_vacate_job 365.0
Job 365.0 vacated
# tail -f /var/log/condor/SchedLog | grep -e 'ERROR.*NULL m_rec'
^C

No null m_rec errors found in scheduler log.

>>> VERIFIED

Comment 11 errata-xmlrpc 2011-09-07 16:41:37 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1249.html


Note You need to log in before you can comment on or make changes to this bug.