[MRG][RFE] We need a way to implement First Come, First Served for dagman jobs where the submission time used for FCFS for spawned jobs is the submission time of the dagman job. FCFS in Condor is based on the submission time of the child jobs submitted by a condor_dagman job and not the initial submission event (the submission time of the dagman job). What we would like to see, and believe is fairer, is for FCFS for dagman child jobs to be based on the submission time of the initial submission event, i.e. the QDate of the dagman job.
Discussions around this settled on a job attribute, or set of attributes, that order jobs before JobPrio does. Specifically, prio_compar() in schedd.cpp checks orders by JobPrio then QDate then ClusterID then ProcID. The proposed solution would bracket the JobPrio and only apply if present on both jobs being compared. The DAGMan submissions will then be able to specify +PreJobPrio=<value> or maybe via DAGMan VAR. The <value> would be determined by the submitter. This is also related to: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=804
There is a case where the proposed solution doesn't prioritize one dag job over another. The case is given two dagman jobs. #dagfile for the first JOB A A.condor JOB B B.condor JOB C C.condor PARENT A CHILD B PARENT B CHILD C #dagfile for second JOB F F.condor JOB G G.condor JOB H H.condor PARENT F CHILD G PARENT G CHILD H Submit the first followed by the second (order doesn't really matter). A.condor will run and the queue will have F.condor in idle state. Condor_dagman won't submit B.condor until A.condor finishes. By then, F.condor will be in running state. Without preemption, B.condor will have to wait until F.condor completes. Assigned priorities won't matter because the higher prio dagman won't have a job in the queue. What the competing dags look like will determine how much of an impact this is on actual results. The same scenario could happen with more complex dag jobs where the next stage of the dag relies upon the results of multiple jobs (the bottom of the diamond example). The higher prio dag won't have a job in the queue.
Created attachment 484460 [details] dagmanFCFS patch
Created attachment 485779 [details] test results
Created attachment 485783 [details] test for fcfs Untar and look at readme and results files.
upstream https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1992
Created attachment 487639 [details] dagman pre/post jobprio patch Two pre levels. Two post levels. I'll need to change my tests and verify.
Created attachment 488217 [details] pr/post prios patch
Created attachment 488224 [details] tests/results for pre/post prio dag patch Untar and read readme and results file
Created attachment 488268 [details] pre/post prio patch Changed code so comparison between a job with assigned pre/post prios and a job with no assigned pre/post prios reverts to old behavior.
Created attachment 488269 [details] dagman fcfs tests untar and read readme and results file
Created attachment 488423 [details] next version of patch removed two lines of extraneous code
Created attachment 488469 [details] dagman-v2 patch Changed logic to INT_MIN rather than -1. Changed conditionals for {} format.
Created attachment 488523 [details] dagman v3 patch fixed int_min and comment
Alternative method of testing... # Using a personal condor configuration with NUM_CPUS=1 $ condor_master $ condor_off -negotiator $ ./submit.sh ...spam... $ condor_on -negotiator ...wait about 5 minutes... $ condor_history -format "%d\t" JobStartDate -format "%d\t" PreJobPrio1 -format "%d\t" PreJobPrio2 -format "%d\t" JobPrio -format "%d\t" PostJobPrio1 -format "%d\n" PostJobPrio2 | sort -n -k1 | cut -f1 --complement | md5sum - 2365d29aa83308493a0387e6038e9cd5 - Visually inspect with: $ condor_history -format "%d " JobStartDate -format "%d " PreJobPrio1 -format "%d " PreJobPrio2 -format "%d " JobPrio -format "%d " PostJobPrio1 -format "%d " PostJobPrio2 -format "%s\n" GlobalJobId | sort -n -k1 submit.sh: #!/bin/sh for pre1 in $(seq -1 1 1); do for pre2 in $(seq 1 -1 -1); do for prio in $(seq -1 1 1); do for post1 in $(seq 1 -1 -1); do for post2 in $(seq -1 1 1); do condor_submit -a pre1=$pre1 \ -a pre2=$pre2 \ -a prio=$prio \ -a post1=$post1 \ -a post2=$post2 \ job.sub done done done done done job.sub: cmd = /bin/sleep args = 1 log = job.log +PreJobPrio1 = $(pre1) +PreJobPrio2 = $(pre2) priority = $(prio) +PostJobPrio1 = $(post1) +PostJobPrio2 = $(post2) queue
Pushed upstream for 7.7.0, available as UPSTREAM-7.7.0-BZ674659-FCFS commit c5f031a105d2d40401053e1e50288e05d88446d2 Author: Jon Thomas <jthomas@redhat> Date: Tue Mar 29 16:11:33 2011 -0400 Added {Pre,Post}JobPrio{1,2} job ad attributes, #1992 ...
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Added PreJobPrio1, PreJobPrio2, PostJobPrio1, PostJobPrio2 job ad attributes. They allow for ordering of jobs outside of the JobPrio attribute.
Retested on all supported platforms x86,x86_64/RHEL5,RHEL6 with test case from Comment #17 and with actual packages: condor-7.6.1-0.4 # condor -v $CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el6 $ $CondorPlatform: X86_64-RedHat_6.0 $ # condor_history -format "%d\t" JobStartDate -format "%d\t" PreJobPrio1 -format "%d\t" PreJobPrio2 -format "%d\t" JobPrio -format "%d\t" PostJobPrio1 -format "%d\n" PostJobPrio2 | sort -n -k1 | cut -f1 --complement | md5sum - 2365d29aa83308493a0387e6038e9cd5 - # condor -v $CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el5 $ $CondorPlatform: X86_64-RedHat_5.6 $ # condor_history -format "%d\t" JobStartDate -format "%d\t" PreJobPrio1 -format "%d\t" PreJobPrio2 -format "%d\t" JobPrio -format "%d\t" PostJobPrio1 -format "%d\n" PostJobPrio2 | sort -n -k1 | cut -f1 --complement | md5sum - 2365d29aa83308493a0387e6038e9cd5 - # condor -v $CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el6 $ $CondorPlatform: I686-RedHat_6.0 $ # condor_history -format "%d\t" JobStartDate -format "%d\t" PreJobPrio1 -format "%d\t" PreJobPrio2 -format "%d\t" JobPrio -format "%d\t" PostJobPrio1 -format "%d\n" PostJobPrio2 | sort -n -k1 | cut -f1 --complement | md5sum - 2365d29aa83308493a0387e6038e9cd5 # condor -v $CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el5 $ $CondorPlatform: I686-RedHat_5.6 $ # condor_history -format "%d\t" JobStartDate -format "%d\t" PreJobPrio1 -format "%d\t" PreJobPrio2 -format "%d\t" JobPrio -format "%d\t" PostJobPrio1 -format "%d\n" PostJobPrio2 | sort -n -k1 | cut -f1 --complement | md5sum - 2365d29aa83308493a0387e6038e9cd5 The priorities were followed by negotiator on all platforms correctly. >>> VERIFIED
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -Added PreJobPrio1, PreJobPrio2, PostJobPrio1, PostJobPrio2 job ad attributes. They allow for ordering of jobs outside of the JobPrio attribute.+Condor now includes the PreJobPrio1, PreJobPrio2, PostJobPrio1, PostJobPrio2 job ad attributes, which allow jobs to be ordered outside the previously present JobPrio attribute.
Technical note can be viewed in the release notes for 2.0 at the documentation stage here: http://documentation-stage.bne.redhat.com/docs/en-US/Red_Hat_Enterprise_MRG/2.0/html-single/MRG_Release_Notes/index.html#tabl-MRG_Release_Notes-GRID_Update_Notes-RHM_Known_Issues
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2011-0889.html