507957 – dagman: schedd both accepts and starves more DAG jobs than MAX_JOBS_RUNNING

Bug 507957 - dagman: schedd both accepts and starves more DAG jobs than MAX_JOBS_RUNNING

Summary: dagman: schedd both accepts and starves more DAG jobs than MAX_JOBS_RUNNING

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	grid
Sub Component:
Version:	Development
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	1.3
Target Release:	---
Assignee:	Pete MacKinnon
QA Contact:	MRG Quality Engineering
Docs Contact:
URL:
Whiteboard:
Depends On:	526480
Blocks:
TreeView+	depends on / blocked

Reported:	2009-06-24 20:21 UTC by Pete MacKinnon
Modified:	2010-07-22 17:17 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-07-22 17:17:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Pete MacKinnon 2009-06-24 20:21:28 UTC

Description of problem:

schedd configured with MAX_JOBS_RUNNING=200
500 concurrent DAG jobs submitted

However the condor_dagman jobs count against the total jobs running. So, it seems schedd will happily accept the 500 dag submissions and THEN realize it is busted well after those dags have submitted their own jobs. Thus, schedd gets stuck.

Version-Release number of selected component (if applicable):
$CondorVersion: 7.3.2 Jun  8 2009 BuildID: RH-7.3.2-0.2.el5 PRE-RELEASE-UWCS $
$CondorPlatform: X86_64-LINUX_RHEL5 $

How reproducible:
100%

Steps to Reproduce:
1. log into ha-schedd (mrg27) as bigmonkey
2. cd dagman
3. ./dag_driver.sh
4. wait for 500 submits to complete
5. condor_q -dag until it summary stops updating running, idle count
  
Actual results:

Job status doesn't change:
[16:08:03][bigmonkey@mrg27:~/dagman]$ condor_q -dag | tail -1
678 jobs; 178 idle, 500 running, 0 held
[16:09:19][bigmonkey@mrg27:~/dagman]$ condor_q -dag | tail -1
678 jobs; 178 idle, 500 running, 0 held

Job outputs doesn't change:
[15:57:04][bigmonkey@mrg27:~/dagman]$ ls -1 /tmp/dag_test/out/* | wc -l
1350
[16:05:41][bigmonkey@mrg27:~/dagman]$ ls -1 /tmp/dag_test/out/* | wc -l
1350

Expected results:

Dunno...schedd puts new top-level dagmans into Held until under limit? Maybe introduce a configurable buffer amount which kicks in and holds new jobs as we approach % of limit?

Additional info:

Comment 1 Pete MacKinnon 2009-07-02 15:44:48 UTC

Perhaps this is intended to be managed by combinations of?
    -maxidle <number>   (Maximum number of idle nodes to allow)
    -maxjobs <number>   (Maximum number of jobs ever submitted at once)

Will experiment...

Comment 2 Pete MacKinnon 2009-09-25 13:46:54 UTC

Examining the submit code, -maxidle/-maxjobs provides no relief since they only count per dagman. Have to look at schedd to figure out if there is a way the submit client can get the "overall" picture at time of submission.

Comment 3 Pete MacKinnon 2009-09-30 15:05:53 UTC

New BZ created (526480) will promote workaround of capacity planning prior to large multi-dag deployments and setting an appropriately high MAX_JOBS_RUNNING (estimate # of concurrent dags X # of nodes for largest dag).

Next level of analysis on this BZ will focus on why schedd appears to not be doing bookkeeping of ALL jobs (dagman+nodes). Will also try to solicit input/opinions from UW.

Comment 4 Pete MacKinnon 2010-04-12 21:09:18 UTC

Referenced upstream at http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=642.
Ultimate resolution in that ticket.

Write a KB article that will be given to Lana for reference in next user Guide.

Consult Mike Cressman or John Thomas for KB article tips.

Comment 6 Pete MacKinnon 2010-06-04 20:53:19 UTC

KB article http://kbase.redhat.com/faq/docs/DOC-33345

submitted to SME jthomas for tech review

Note You need to log in before you can comment on or make changes to this bug.