Bug 703401

Summary: Dagman jobs with dynamic slots runs only one dagman job at a time
Product: Red Hat Enterprise MRG Reporter: Tomas Rusnak <trusnak>
Component: condorAssignee: Matthew Farrellee <matt>
Status: CLOSED NOTABUG QA Contact: Tomas Rusnak <trusnak>
Severity: high Docs Contact:
Priority: high    
Version: 2.0CC: matt, tstclair
Target Milestone: 2.0.1   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: condor-7.6.1-0.4 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-01-06 08:05:23 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Tomas Rusnak 2011-05-10 09:33:34 UTC
Description of problem:
During 584562 testing I run 2 dagman jobs, each from separate schedd. When I configured dynamic slots to see all two jobs running, only one was in running state. Second one was in idle state, while first job was running.

Version-Release number of selected component (if applicable):
$CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el6 $
$CondorPlatform: X86_64-RedHat_6.0 $

How reproducible:
100%

Steps to Reproduce:
1. set up condor with dynamic slots and multiple shedds
2. run 2 dagman jobs each from separate schedd
3. check condor_status and condor_q -global
  
Actual results:
Only one dagman job is running

Expected results:
All jobs are running if there are enough free slots.

Additional info:

-config:

NUM_CPUS=8
SLOT_TYPE_1 = cpus=8
SLOT_TYPE_1_PARTITIONABLE = TRUE
NUM_SLOTS_TYPE_1 = 1

SCHEDD0 = $(SCHEDD)
SCHEDD0_ARGS = -f -local-name schedd0
SCHEDD.SCHEDD0.SCHEDD_NAME = schedd0
SCHEDD.SCHEDD0.SCHEDD_ADDRESS_FILE = /tmp/schedd0/schedd_address_file
SCHEDD.SCHEDD0.SCHEDD_DAEMON_AD_FILE = /tmp/schedd0/schedd_classad
SCHEDD.SCHEDD0.SPOOL = /tmp/schedd0
SCHEDD1 = $(SCHEDD)
SCHEDD1_ARGS = -f -local-name schedd1
SCHEDD.SCHEDD1.SCHEDD_NAME = schedd1
SCHEDD.SCHEDD1.SCHEDD_ADDRESS_FILE = /tmp/schedd1/schedd_address_file
SCHEDD.SCHEDD1.SCHEDD_DAEMON_AD_FILE = /tmp/schedd1/schedd_classad
SCHEDD.SCHEDD1.SPOOL = /tmp/schedd1
DAEMON_LIST = MASTER, NEGOTIATOR,  COLLECTOR, SCHEDD0, SCHEDD1, STARTD

-submit:

# cat diamond.dag
# this file is called diamond.dag 
JOB A A.submit 
JOB B B.submit 
JOB C C.submit 
JOB D D.submit 
PARENT A CHILD B C 
PARENT B C CHILD D

# cat A.submit    <<< each dags job is same as A
Universe   = vanilla
cmd = /bin/sleep
args= 100
output     = A.out
log        = diamond.log
Queue

# sudo -u test condor_submit_dag -schedd-daemon-ad-file /tmp/schedd0/schedd_classad diamond.dag

-----------------------------------------------------------------------
File for submitting this DAG to Condor           : diamond.dag.condor.sub
Log of DAGMan debugging messages                 : diamond.dag.dagman.out
Log of Condor library output                     : diamond.dag.lib.out
Log of Condor library error messages             : diamond.dag.lib.err
Log of the life of condor_dagman itself          : diamond.dag.dagman.log

Submitting job(s).
1 job(s) submitted to cluster 7.

# sudo -u test condor_submit_dag -schedd-daemon-ad-file /tmp/schedd1/schedd_classad diamond.dag

-----------------------------------------------------------------------
File for submitting this DAG to Condor           : diamond.dag.condor.sub
Log of DAGMan debugging messages                 : diamond.dag.dagman.out
Log of Condor library output                     : diamond.dag.lib.out
Log of Condor library error messages             : diamond.dag.lib.err
Log of the life of condor_dagman itself          : diamond.dag.dagman.log

Submitting job(s).
1 job(s) submitted to cluster 13.
-----------------------------------------------------------------------

-results:

# condor_q -global 
-- Schedd: schedd0@hostname : <IP:55581>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   7.0   test            5/10 11:15   0+00:06:21 R  0   2.0  condor_dagman     
  11.0   test            5/10 11:21   0+00:00:00 I  0   0.0  sleep 100         
2 jobs; 1 idle, 1 running, 0 held

-- Schedd: schedd1@hostname : <IP:46922>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  13.0   test            5/10 11:17   0+00:04:26 R  0   2.0  condor_dagman     
  15.0   test            5/10 11:19   0+00:01:10 R  0   0.0  sleep 100         
  16.0   test            5/10 11:19   0+00:00:21 R  0   0.0  sleep 100         

3 jobs; 0 idle, 3 running, 0 held

# condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
slot1@hostname   LINUX      X86_64 Unclaimed Idle     0.380   487  0+00:00:04
slot1_1@hostname LINUX      X86_64 Claimed   Busy     0.000     1  0+00:00:03
slot1_2@hostname LINUX      X86_64 Claimed   Busy     0.000     1  0+00:00:53
                     Machines Owner Claimed Unclaimed Matched Preempting
        X86_64/LINUX        3     0       2         1       0          0
               Total        3     0       2         1       0          0

The expected result could be 3 slots in busy - 1 for 1st job from schedd0 and 2 for 2nd job from schedd1

Comment 4 Tomas Rusnak 2012-01-06 08:05:23 UTC
Retested over multiple versions of condor. No such problem found. The rescheduling is always slowed down on slower machine, then it looks like not parallel dag.

>>> NOTABUG