| Summary: | Dagman jobs with dynamic slots runs only one dagman job at a time | ||
|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Tomas Rusnak <trusnak> |
| Component: | condor | Assignee: | Matthew Farrellee <matt> |
| Status: | CLOSED NOTABUG | QA Contact: | Tomas Rusnak <trusnak> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 2.0 | CC: | matt, tstclair |
| Target Milestone: | 2.0.1 | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | condor-7.6.1-0.4 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2012-01-06 08:05:23 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
Retested over multiple versions of condor. No such problem found. The rescheduling is always slowed down on slower machine, then it looks like not parallel dag.
>>> NOTABUG
|
Description of problem: During 584562 testing I run 2 dagman jobs, each from separate schedd. When I configured dynamic slots to see all two jobs running, only one was in running state. Second one was in idle state, while first job was running. Version-Release number of selected component (if applicable): $CondorVersion: 7.6.1 Apr 27 2011 BuildID: RH-7.6.1-0.4.el6 $ $CondorPlatform: X86_64-RedHat_6.0 $ How reproducible: 100% Steps to Reproduce: 1. set up condor with dynamic slots and multiple shedds 2. run 2 dagman jobs each from separate schedd 3. check condor_status and condor_q -global Actual results: Only one dagman job is running Expected results: All jobs are running if there are enough free slots. Additional info: -config: NUM_CPUS=8 SLOT_TYPE_1 = cpus=8 SLOT_TYPE_1_PARTITIONABLE = TRUE NUM_SLOTS_TYPE_1 = 1 SCHEDD0 = $(SCHEDD) SCHEDD0_ARGS = -f -local-name schedd0 SCHEDD.SCHEDD0.SCHEDD_NAME = schedd0 SCHEDD.SCHEDD0.SCHEDD_ADDRESS_FILE = /tmp/schedd0/schedd_address_file SCHEDD.SCHEDD0.SCHEDD_DAEMON_AD_FILE = /tmp/schedd0/schedd_classad SCHEDD.SCHEDD0.SPOOL = /tmp/schedd0 SCHEDD1 = $(SCHEDD) SCHEDD1_ARGS = -f -local-name schedd1 SCHEDD.SCHEDD1.SCHEDD_NAME = schedd1 SCHEDD.SCHEDD1.SCHEDD_ADDRESS_FILE = /tmp/schedd1/schedd_address_file SCHEDD.SCHEDD1.SCHEDD_DAEMON_AD_FILE = /tmp/schedd1/schedd_classad SCHEDD.SCHEDD1.SPOOL = /tmp/schedd1 DAEMON_LIST = MASTER, NEGOTIATOR, COLLECTOR, SCHEDD0, SCHEDD1, STARTD -submit: # cat diamond.dag # this file is called diamond.dag JOB A A.submit JOB B B.submit JOB C C.submit JOB D D.submit PARENT A CHILD B C PARENT B C CHILD D # cat A.submit <<< each dags job is same as A Universe = vanilla cmd = /bin/sleep args= 100 output = A.out log = diamond.log Queue # sudo -u test condor_submit_dag -schedd-daemon-ad-file /tmp/schedd0/schedd_classad diamond.dag ----------------------------------------------------------------------- File for submitting this DAG to Condor : diamond.dag.condor.sub Log of DAGMan debugging messages : diamond.dag.dagman.out Log of Condor library output : diamond.dag.lib.out Log of Condor library error messages : diamond.dag.lib.err Log of the life of condor_dagman itself : diamond.dag.dagman.log Submitting job(s). 1 job(s) submitted to cluster 7. # sudo -u test condor_submit_dag -schedd-daemon-ad-file /tmp/schedd1/schedd_classad diamond.dag ----------------------------------------------------------------------- File for submitting this DAG to Condor : diamond.dag.condor.sub Log of DAGMan debugging messages : diamond.dag.dagman.out Log of Condor library output : diamond.dag.lib.out Log of Condor library error messages : diamond.dag.lib.err Log of the life of condor_dagman itself : diamond.dag.dagman.log Submitting job(s). 1 job(s) submitted to cluster 13. ----------------------------------------------------------------------- -results: # condor_q -global -- Schedd: schedd0@hostname : <IP:55581> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 7.0 test 5/10 11:15 0+00:06:21 R 0 2.0 condor_dagman 11.0 test 5/10 11:21 0+00:00:00 I 0 0.0 sleep 100 2 jobs; 1 idle, 1 running, 0 held -- Schedd: schedd1@hostname : <IP:46922> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 13.0 test 5/10 11:17 0+00:04:26 R 0 2.0 condor_dagman 15.0 test 5/10 11:19 0+00:01:10 R 0 0.0 sleep 100 16.0 test 5/10 11:19 0+00:00:21 R 0 0.0 sleep 100 3 jobs; 0 idle, 3 running, 0 held # condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@hostname LINUX X86_64 Unclaimed Idle 0.380 487 0+00:00:04 slot1_1@hostname LINUX X86_64 Claimed Busy 0.000 1 0+00:00:03 slot1_2@hostname LINUX X86_64 Claimed Busy 0.000 1 0+00:00:53 Machines Owner Claimed Unclaimed Matched Preempting X86_64/LINUX 3 0 2 1 0 0 Total 3 0 2 1 0 0 The expected result could be 3 slots in busy - 1 for 1st job from schedd0 and 2 for 2nd job from schedd1