Bug 674659 - [MRG][RFE] Implement First Come, First Served for dagman jobs
Summary: [MRG][RFE] Implement First Come, First Served for dagman jobs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: 2.0
: ---
Assignee: Jon Thomas
QA Contact: Tomas Rusnak
URL:
Whiteboard:
Depends On:
Blocks: 693778
TreeView+ depends on / blocked
 
Reported: 2011-02-02 20:30 UTC by Jon Thomas
Modified: 2018-11-14 15:39 UTC (History)
7 users (show)

Fixed In Version: condor-7.6.0-0.4
Doc Type: Enhancement
Doc Text:
Condor now includes the PreJobPrio1, PreJobPrio2, PostJobPrio1, PostJobPrio2 job ad attributes, which allow jobs to be ordered outside the previously present JobPrio attribute.
Clone Of:
Environment:
Last Closed: 2011-06-23 15:38:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
dagmanFCFS patch (5.93 KB, patch)
2011-03-15 13:01 UTC, Jon Thomas
no flags Details | Diff
test results (11.96 KB, text/plain)
2011-03-16 16:05 UTC, Jon Thomas
no flags Details
test for fcfs (40.00 KB, application/x-tar)
2011-03-16 16:18 UTC, Jon Thomas
no flags Details
dagman pre/post jobprio patch (9.37 KB, patch)
2011-03-25 18:33 UTC, Jon Thomas
no flags Details | Diff
pr/post prios patch (8.99 KB, patch)
2011-03-28 17:59 UTC, Jon Thomas
no flags Details | Diff
tests/results for pre/post prio dag patch (50.00 KB, text/plain)
2011-03-28 18:41 UTC, Jon Thomas
no flags Details
pre/post prio patch (9.65 KB, patch)
2011-03-28 21:05 UTC, Jon Thomas
no flags Details | Diff
dagman fcfs tests (50.00 KB, application/x-tar)
2011-03-28 21:07 UTC, Jon Thomas
no flags Details
next version of patch (9.45 KB, patch)
2011-03-29 12:43 UTC, Jon Thomas
no flags Details | Diff
dagman-v2 patch (9.64 KB, patch)
2011-03-29 14:39 UTC, Jon Thomas
no flags Details | Diff
dagman v3 patch (9.60 KB, patch)
2011-03-29 18:17 UTC, Jon Thomas
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2011:0889 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Grid 2.0 Release 2011-06-23 15:35:53 UTC

Description Jon Thomas 2011-02-02 20:30:11 UTC
[MRG][RFE] We need a way to implement First Come, First Served for dagman jobs where the submission time used for FCFS for spawned jobs is the submission time of the dagman job.

FCFS in Condor is based on the submission time of the child jobs submitted by a condor_dagman job and not the initial submission event (the submission time of the dagman job). What we would like to see, and believe is fairer, is for FCFS for dagman child jobs to be based on the submission time of the initial submission event, i.e. the QDate of the dagman job.

Comment 1 Matthew Farrellee 2011-02-02 21:28:56 UTC
Discussions around this settled on a job attribute, or set of attributes, that order jobs before JobPrio does. Specifically, prio_compar() in schedd.cpp checks orders by JobPrio then QDate then ClusterID then ProcID. The proposed solution would bracket the JobPrio and only apply if present on both jobs being compared.

The DAGMan submissions will then be able to specify +PreJobPrio=<value> or maybe via DAGMan VAR. The <value> would be determined by the submitter.

This is also related to:

   https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=804

Comment 2 Jon Thomas 2011-02-14 21:05:54 UTC
There is a case where the proposed solution doesn't prioritize one dag job over another.

The case is given two dagman jobs.
 
#dagfile for the first
JOB  A  A.condor 
JOB  B  B.condor 
JOB  C  C.condor
PARENT A CHILD B 
PARENT B CHILD C

#dagfile for second
JOB  F  F.condor 
JOB  G  G.condor 
JOB  H  H.condor
PARENT F CHILD G 
PARENT G CHILD H

Submit the first followed by the second (order doesn't really matter).

A.condor will run and the queue will have F.condor in idle state. Condor_dagman won't submit B.condor until A.condor finishes. By then, F.condor will be in running state. Without preemption, B.condor will have to wait until F.condor completes. Assigned priorities won't matter because the higher prio dagman won't have a job in the queue.

What the competing dags look like will determine how much of an impact this is on actual results. The same scenario could happen with more complex dag jobs where the next stage of the dag relies upon the results of multiple jobs (the bottom of the diamond example). The higher prio dag won't have a job in the queue.

Comment 4 Jon Thomas 2011-03-15 13:01:38 UTC
Created attachment 484460 [details]
dagmanFCFS patch

Comment 5 Jon Thomas 2011-03-16 16:05:07 UTC
Created attachment 485779 [details]
test results

Comment 6 Jon Thomas 2011-03-16 16:18:56 UTC
Created attachment 485783 [details]
test for fcfs

Untar and look at readme and results files.

Comment 7 Jon Thomas 2011-03-22 14:46:21 UTC
upstream https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1992

Comment 9 Jon Thomas 2011-03-25 18:33:09 UTC
Created attachment 487639 [details]
dagman pre/post jobprio patch

Two pre levels. Two post levels.

I'll need to change my tests and verify.

Comment 10 Jon Thomas 2011-03-28 17:59:48 UTC
Created attachment 488217 [details]
pr/post prios patch

Comment 11 Jon Thomas 2011-03-28 18:41:08 UTC
Created attachment 488224 [details]
tests/results for pre/post prio dag patch

Untar and read readme and results file

Comment 12 Jon Thomas 2011-03-28 21:05:43 UTC
Created attachment 488268 [details]
pre/post prio patch

Changed code so comparison between a job with assigned pre/post prios and a job with no assigned pre/post prios reverts to old behavior.

Comment 13 Jon Thomas 2011-03-28 21:07:02 UTC
Created attachment 488269 [details]
dagman fcfs tests

untar and read readme and results file

Comment 14 Jon Thomas 2011-03-29 12:43:25 UTC
Created attachment 488423 [details]
next version of patch

removed two lines of extraneous code

Comment 15 Jon Thomas 2011-03-29 14:39:24 UTC
Created attachment 488469 [details]
dagman-v2 patch

Changed logic to INT_MIN rather than -1. Changed conditionals for {} format.

Comment 16 Jon Thomas 2011-03-29 18:17:19 UTC
Created attachment 488523 [details]
dagman v3 patch

fixed int_min and comment

Comment 17 Matthew Farrellee 2011-03-29 19:52:41 UTC
Alternative method of testing...

# Using a personal condor configuration with NUM_CPUS=1
$ condor_master
$ condor_off -negotiator
$ ./submit.sh
...spam...
$ condor_on -negotiator
...wait about 5 minutes...
$ condor_history -format "%d\t" JobStartDate -format "%d\t" PreJobPrio1 -format "%d\t" PreJobPrio2 -format "%d\t" JobPrio -format "%d\t" PostJobPrio1 -format "%d\n" PostJobPrio2 | sort -n -k1 | cut -f1 --complement | md5sum -
2365d29aa83308493a0387e6038e9cd5  -

Visually inspect with:

$ condor_history -format "%d " JobStartDate -format "%d " PreJobPrio1 -format "%d " PreJobPrio2 -format "%d " JobPrio -format "%d " PostJobPrio1 -format "%d " PostJobPrio2 -format "%s\n" GlobalJobId | sort -n -k1


submit.sh:

#!/bin/sh

for pre1 in $(seq -1 1 1); do
   for pre2 in $(seq 1 -1 -1); do
      for prio in $(seq -1 1 1); do
         for post1 in $(seq 1 -1 -1); do
            for post2 in $(seq -1 1 1); do
               condor_submit -a pre1=$pre1 \
                             -a pre2=$pre2 \
                             -a prio=$prio \
                             -a post1=$post1 \
                             -a post2=$post2 \
                  job.sub
            done
         done
      done
   done
done


job.sub:

cmd = /bin/sleep
args = 1

log = job.log

+PreJobPrio1 = $(pre1)
+PreJobPrio2 = $(pre2)
priority = $(prio)
+PostJobPrio1 = $(post1)
+PostJobPrio2 = $(post2)

queue

Comment 18 Matthew Farrellee 2011-03-29 20:22:51 UTC
Pushed upstream for 7.7.0, available as UPSTREAM-7.7.0-BZ674659-FCFS

commit c5f031a105d2d40401053e1e50288e05d88446d2
Author: Jon Thomas <jthomas@redhat>
Date:   Tue Mar 29 16:11:33 2011 -0400

    Added {Pre,Post}JobPrio{1,2} job ad attributes, #1992
...

Comment 19 Matthew Farrellee 2011-04-27 20:34:50 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Added PreJobPrio1, PreJobPrio2, PostJobPrio1, PostJobPrio2 job ad attributes. They allow for ordering of jobs outside of the JobPrio attribute.

Comment 21 Tomas Rusnak 2011-05-11 11:18:32 UTC
Retested on all supported platforms x86,x86_64/RHEL5,RHEL6 with test case from Comment #17 and with actual packages:

condor-7.6.1-0.4

# condor -v
$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el6 $
$CondorPlatform: X86_64-RedHat_6.0 $

# condor_history -format "%d\t" JobStartDate -format "%d\t" PreJobPrio1 -format "%d\t" PreJobPrio2 -format "%d\t" JobPrio -format "%d\t" PostJobPrio1 -format "%d\n" PostJobPrio2 | sort -n -k1 | cut -f1 --complement | md5sum -
2365d29aa83308493a0387e6038e9cd5  -

# condor -v
$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el5 $
$CondorPlatform: X86_64-RedHat_5.6 $

# condor_history -format "%d\t" JobStartDate -format "%d\t" PreJobPrio1 -format "%d\t" PreJobPrio2 -format "%d\t" JobPrio -format "%d\t" PostJobPrio1 -format "%d\n" PostJobPrio2 | sort -n -k1 | cut -f1 --complement | md5sum -
2365d29aa83308493a0387e6038e9cd5  -

# condor -v
$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el6 $
$CondorPlatform: I686-RedHat_6.0 $

# condor_history -format "%d\t" JobStartDate -format "%d\t" PreJobPrio1 -format "%d\t" PreJobPrio2 -format "%d\t" JobPrio -format "%d\t" PostJobPrio1 -format "%d\n" PostJobPrio2 | sort -n -k1 | cut -f1 --complement | md5sum -
2365d29aa83308493a0387e6038e9cd5  

# condor -v
$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el5 $
$CondorPlatform: I686-RedHat_5.6 $

# condor_history -format "%d\t" JobStartDate -format "%d\t" PreJobPrio1 -format "%d\t" PreJobPrio2 -format "%d\t" JobPrio -format "%d\t" PostJobPrio1 -format "%d\n" PostJobPrio2 | sort -n -k1 | cut -f1 --complement | md5sum -
2365d29aa83308493a0387e6038e9cd5 

The priorities were followed by negotiator on all platforms correctly.

>>> VERIFIED

Comment 22 Misha H. Ali 2011-05-30 06:43:36 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-Added PreJobPrio1, PreJobPrio2, PostJobPrio1, PostJobPrio2 job ad attributes. They allow for ordering of jobs outside of the JobPrio attribute.+Condor now includes the PreJobPrio1, PreJobPrio2, PostJobPrio1, PostJobPrio2 job ad attributes, which allow jobs to be ordered outside the previously present JobPrio attribute.

Comment 23 Misha H. Ali 2011-06-06 03:29:56 UTC
Technical note can be viewed in the release notes for 2.0 at the documentation stage here:

http://documentation-stage.bne.redhat.com/docs/en-US/Red_Hat_Enterprise_MRG/2.0/html-single/MRG_Release_Notes/index.html#tabl-MRG_Release_Notes-GRID_Update_Notes-RHM_Known_Issues

Comment 24 errata-xmlrpc 2011-06-23 15:38:10 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0889.html


Note You need to log in before you can comment on or make changes to this bug.