Hide Forgot
Description of problem: During 584562 testing I run 2 dagman jobs, each from separate schedd. When I configured dynamic slots to see all two jobs running, only one was in running state. Second one was in idle state, while first job was running. Version-Release number of selected component (if applicable): $CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el6 $ $CondorPlatform: X86_64-RedHat_6.0 $ How reproducible: 100% Steps to Reproduce: 1. set up condor with dynamic slots and multiple shedds 2. run 2 dagman jobs each from separate schedd 3. check condor_status and condor_q -global Actual results: Only one dagman job is running Expected results: All jobs are running if there are enough free slots. Additional info: -config: NUM_CPUS=8 SLOT_TYPE_1 = cpus=8 SLOT_TYPE_1_PARTITIONABLE = TRUE NUM_SLOTS_TYPE_1 = 1 SCHEDD0 = $(SCHEDD) SCHEDD0_ARGS = -f -local-name schedd0 SCHEDD.SCHEDD0.SCHEDD_NAME = schedd0 SCHEDD.SCHEDD0.SCHEDD_ADDRESS_FILE = /tmp/schedd0/schedd_address_file SCHEDD.SCHEDD0.SCHEDD_DAEMON_AD_FILE = /tmp/schedd0/schedd_classad SCHEDD.SCHEDD0.SPOOL = /tmp/schedd0 SCHEDD1 = $(SCHEDD) SCHEDD1_ARGS = -f -local-name schedd1 SCHEDD.SCHEDD1.SCHEDD_NAME = schedd1 SCHEDD.SCHEDD1.SCHEDD_ADDRESS_FILE = /tmp/schedd1/schedd_address_file SCHEDD.SCHEDD1.SCHEDD_DAEMON_AD_FILE = /tmp/schedd1/schedd_classad SCHEDD.SCHEDD1.SPOOL = /tmp/schedd1 DAEMON_LIST = MASTER, NEGOTIATOR, COLLECTOR, SCHEDD0, SCHEDD1, STARTD -submit: # cat diamond.dag # this file is called diamond.dag JOB A A.submit JOB B B.submit JOB C C.submit JOB D D.submit PARENT A CHILD B C PARENT B C CHILD D # cat A.submit <<< each dags job is same as A Universe = vanilla cmd = /bin/sleep args= 100 output = A.out log = diamond.log Queue # sudo -u test condor_submit_dag -schedd-daemon-ad-file /tmp/schedd0/schedd_classad diamond.dag ----------------------------------------------------------------------- File for submitting this DAG to Condor : diamond.dag.condor.sub Log of DAGMan debugging messages : diamond.dag.dagman.out Log of Condor library output : diamond.dag.lib.out Log of Condor library error messages : diamond.dag.lib.err Log of the life of condor_dagman itself : diamond.dag.dagman.log Submitting job(s). 1 job(s) submitted to cluster 7. # sudo -u test condor_submit_dag -schedd-daemon-ad-file /tmp/schedd1/schedd_classad diamond.dag ----------------------------------------------------------------------- File for submitting this DAG to Condor : diamond.dag.condor.sub Log of DAGMan debugging messages : diamond.dag.dagman.out Log of Condor library output : diamond.dag.lib.out Log of Condor library error messages : diamond.dag.lib.err Log of the life of condor_dagman itself : diamond.dag.dagman.log Submitting job(s). 1 job(s) submitted to cluster 13. ----------------------------------------------------------------------- -results: # condor_q -global -- Schedd: schedd0@hostname : <IP:55581> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 7.0 test 5/10 11:15 0+00:06:21 R 0 2.0 condor_dagman 11.0 test 5/10 11:21 0+00:00:00 I 0 0.0 sleep 100 2 jobs; 1 idle, 1 running, 0 held -- Schedd: schedd1@hostname : <IP:46922> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 13.0 test 5/10 11:17 0+00:04:26 R 0 2.0 condor_dagman 15.0 test 5/10 11:19 0+00:01:10 R 0 0.0 sleep 100 16.0 test 5/10 11:19 0+00:00:21 R 0 0.0 sleep 100 3 jobs; 0 idle, 3 running, 0 held # condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@hostname LINUX X86_64 Unclaimed Idle 0.380 487 0+00:00:04 slot1_1@hostname LINUX X86_64 Claimed Busy 0.000 1 0+00:00:03 slot1_2@hostname LINUX X86_64 Claimed Busy 0.000 1 0+00:00:53 Machines Owner Claimed Unclaimed Matched Preempting X86_64/LINUX 3 0 2 1 0 0 Total 3 0 2 1 0 0 The expected result could be 3 slots in busy - 1 for 1st job from schedd0 and 2 for 2nd job from schedd1
Retested over multiple versions of condor. No such problem found. The rescheduling is always slowed down on slower machine, then it looks like not parallel dag. >>> NOTABUG