Bug 809732 - PU job is runned before slots are cleaned from previous (removed) job
PU job is runned before slots are cleaned from previous (removed) job
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor (Show other bugs)
2.1.1
All Linux
high Severity low
: 2.3
: ---
Assigned To: Timothy St. Clair
Daniel Horák
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-04-04 04:42 EDT by Daniel Horák
Modified: 2013-03-06 13:43 EST (History)
4 users (show)

See Also:
Fixed In Version: condor-7.8.7-0.5
Doc Type: Bug Fix
Doc Text:
Cause: Remove a parallel universe job whose claim can be resused. Consequence: Job status can be invalid during a case where a parallel universe job is removed and whose claim can be reused. Fix: Update job status reporting. Result: Job status is accurate.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-03-06 13:43:34 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
condor logs (311.88 KB, application/x-gzip)
2012-09-07 15:18 EDT, Timothy St. Clair
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Condor 3223 None None None 2012-09-07 17:14:59 EDT
Red Hat Product Errata RHSA-2013:0564 normal SHIPPED_LIVE Low: Red Hat Enterprise MRG Grid 2.3 security update 2013-03-06 18:37:09 EST

  None (edit)
Description Daniel Horák 2012-04-04 04:42:30 EDT
Description of problem:
  Immediately after manually remove running PU job, is another job marked as running, but slots are engaged by the previous job and the new one is after 20s returned back to idle state. Then after a while is job again (and this time correctly) runned.


Version-Release number of selected component (if applicable):
$ rpm -qa | grep condor
  condor-classads-7.6.5-0.12.el5
  condor-7.6.5-0.12.el5


How reproducible:
  100%


Steps to Reproduce:
1. Configure condor pool for PU universe:
# cat /etc/condor/99_pu.config
  CONDOR_HOST=HOST
  ALLOW_WRITE = *
  ALLOW_READ = *
  ALLOW_ADMINISTRATOR = *
  NEGOTIATOR_INTERVAL = 30
  SCHEDD_INTERVAL = 15
  COLLECTOR_UPDATE_INTERVAL = 15
  NUM_CPUS = 1
  DedicatedScheduler="DedicatedScheduler@HOST"
  STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler
  START		= True
  SUSPEND	= False
  CONTINUE	= True
  PREEMPT	= False
  KILL		= False
  WANT_SUSPEND	= False
  WANT_VACATE	= False
  RANK		= Scheduler =?= $(DedicatedScheduler)
  CONDOR_SSHD = /usr/sbin/sshd
  CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
  MPI_CONDOR_RSH_PATH = $(LIBEXEC)
  UID_DOMAIN = *
  #DAEMON_LIST = MASTER, STARTD


2. Submit few simple PU jobs.
$ cat parallel.job
  universe = parallel
  executable = parallel.sh
  args = "60"
  log = /tmp/parallel_job.$(cluster).$(process).log
  output = /tmp/parallel_job.$(cluster).$(process)-$(NODE).out
  error = /tmp/parallel_job.$(cluster).$(process)-$(NODE).err
  machine_count = 4
  should_transfer_files = yes
  when_to_transfer_output = on_exit
  requirements = (FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED)
  +ParallelShutdownPolicy = "WAIT_FOR_ALL"
  queue

$ cat parallel.sh
  #!/usr/bin/env bash
  echo "hostname: $(hostname)"
  echo "whoami: $(whoami)"
  echo "id: $(id)"
  sleep 0$1
  echo $@


3. Wait until one job is running and then remove it (and watch condor_q and condor_status):
...
$ condor_submit parallel.job
  Submitting job(s).
  1 job(s) submitted to cluster 4.

$ condor_q && condor_status
  -- Submitter: HOST : <IP:36111> : HOST
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
     1.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh
     2.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh
     3.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh
     4.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh
  4 jobs; 0 completed, 0 removed, 4 idle, 0 running, 0 held, 0 suspended

  Name  OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

  HOST1 LINUX      INTEL  Unclaimed Idle     0.100   502  0+00:00:04
  HOST2 LINUX      INTEL  Unclaimed Idle     0.000   499  0+00:00:04
  HOST3 LINUX      X86_64 Unclaimed Idle     0.140   497  0+00:00:04
  HOST4 LINUX      X86_64 Unclaimed Idle     0.060   490  0+00:00:04

                       Machines Owner Claimed Unclaimed Matched Preempting
           INTEL/LINUX        2     0       0         2       0          0
          X86_64/LINUX        2     0       0         2       0          0
                 Total        4     0       0         4       0          0

$ condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s
  -- Submitter: HOST : <IP:36111> : HOST
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
     1.0   test            4/4  09:41   0+00:00:17 R  0   0.0  parallel.sh
     2.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh
     3.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh
     4.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh

  4 jobs; 0 completed, 0 removed, 3 idle, 1 running, 0 held, 0 suspended

  JobId = "1.0"
  State = "Claimed"
  JobId = "1.0"
  State = "Claimed"
  JobId = "1.0"
  State = "Claimed"
  JobId = "1.0"
  State = "Claimed"

$ condor_rm 1.0
  Job 1.0 marked for removal

$ condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s
  -- Submitter: HOST : <IP:36111> : HOST
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
     2.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh
     3.0   test            4/4  09:41   0+00:00:16 R  0   0.0  parallel.sh
     4.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh

  3 jobs; 0 completed, 0 removed, 2 idle, 1 running, 0 held, 0 suspended

  JobId = "1.0"
  State = "Claimed"
  JobId = "3.0"
  State = "Claimed"
  JobId = "1.0"
  State = "Claimed"
  JobId = "1.0"
  State = "Claimed"

$ condor_q && condor_status
  -- Submitter: HOST : <IP:36111> : HOST
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
     2.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh
     3.0   test            4/4  09:41   0+00:00:21 I  0   0.0  parallel.sh
     4.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh

  3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended

  Name  OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

  HOST1 LINUX      INTEL  Unclaimed Idle     0.300   502  0+00:00:04
  HOST2 LINUX      INTEL  Unclaimed Idle     0.050   499  0+00:00:04
  HOST3 LINUX      X86_64 Unclaimed Idle     0.150   497  0+00:00:04
  HOST4 LINUX      X86_64 Unclaimed Idle     0.100   490  0+00:00:04

                       Machines Owner Claimed Unclaimed Matched Preempting

           INTEL/LINUX        2     0       0         2       0          0
          X86_64/LINUX        2     0       0         2       0          0

                 Total        4     0       0         4       0          0

$ condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s
  -- Submitter: HOST : <IP:36111> : HOST
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
     2.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh
     3.0   test            4/4  09:41   0+00:00:38 R  0   0.0  parallel.sh
     4.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh

  3 jobs; 0 completed, 0 removed, 2 idle, 1 running, 0 held, 0 suspended

  JobId = "3.0"
  State = "Claimed"
  JobId = "3.0"
  State = "Claimed"
  JobId = "3.0"
  State = "Claimed"
  JobId = "3.0"
  State = "Claimed"


Actual results:
  Job is marked as RUNNING immediately after previous one is removed, but it isn't correctly running because slots are for a while occupied by the removed job.

Expected results:
  Job isn't runned before all slots are correctly cleaned from previous job.

Additional info:
Comment 2 Timothy St. Clair 2012-08-30 15:13:08 EDT
Testing against condor-7.8 series it does not repro
Comment 7 Timothy St. Clair 2012-09-07 12:52:32 EDT
In tracing the logs on the Q/E repro machine I see the case is indeed timing based.  

The following events occur:

1.) Schedd receives condor_rm event sends message to shadow
2.) Shadow notify starters
3.) Starters try to cleanup sending message to startd 
4.) Shadow does not hang around waiting for cleanup
5.) Startd && Starter hang in a holding pattern waiting for shadow
6.) Schedd spins a new shadow trying to reuse claim for next job
7.) New shadow waits around for cleanup of old job which never happens. 
8.) Shadow gives up 
9.) Claim released.
Comment 8 Timothy St. Clair 2012-09-07 15:18:57 EDT
Created attachment 610832 [details]
condor logs

Shows failure case in ShadowLog and StarterLog
Comment 12 Timothy St. Clair 2012-09-12 10:25:51 EDT
The fix now waits for all starters final shutdown call which depends on 

+ParallelShutdownPolicy = "WAIT_FOR_ALL"

To verify perform the test mentioned and check for the following ShadowLog message w/ D_FULLDEBUG:

"ParallelShadow:​:​shutDown·​WAIT_FOR_ALL ​- ​All ​resources ​have ​called ​exit/​shutdown"
Comment 14 Daniel Horák 2012-11-12 09:59:10 EST
Tested on new packages condor-7.8.7-0.4.el5.i386 on RHEL 5.9 i386 with 4 machines in pool and it doesn't work (expected log message along to comment 12 is present, but the behaviour is same as described in comment 0):

# rpm -q condor
  condor-7.8.7-0.4.el5.i386

# condor_status
  Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

  HOST1 LINUX      INTEL  Claimed   Busy     0.210   502  0+00:00:03
  HOST2 LINUX      INTEL  Claimed   Busy     0.270   502  0+00:00:02
  HOST3 LINUX      INTEL  Claimed   Busy     0.270   502  0+00:00:03
  HOST4 LINUX      INTEL  Claimed   Busy     0.480   502  0+00:00:02
                       Machines Owner Claimed Unclaimed Matched Preempting

           INTEL/LINUX        4     0       4         0       0          0

                 Total        4     0       4         0       0          0

# condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s
  -- Submitter: HOST1 : <IP:38202> : HOST1
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
     1.0   test           11/12 15:44   0+00:00:14 R  0   0.0  parallel.sh       
     2.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       
     3.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       
     4.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       
     5.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       

  5 jobs; 0 completed, 0 removed, 4 idle, 1 running, 0 held, 0 suspended
  JobId = "1.0"
  State = "Claimed"
  JobId = "1.0"
  State = "Claimed"
  JobId = "1.0"
  State = "Claimed"
  JobId = "1.0"
  State = "Claimed"

$ condor_rm 1
  Cluster 1 has been marked for removal.

# condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s
  -- Submitter: HOST1 : <IP:38202> : HOST1
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
     2.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       
     3.0   test           11/12 15:44   0+00:00:05 R  0   0.0  parallel.sh       
     4.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       
     5.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       

  4 jobs; 0 completed, 0 removed, 3 idle, 1 running, 0 held, 0 suspended
  JobId = "3.0"
  State = "Claimed"
  JobId = "3.0"
  State = "Claimed"
  JobId = "1.0"
  State = "Claimed"
  JobId = "3.0"
  State = "Claimed"

# condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s
  -- Submitter: HOST1 : <IP:38202> : HOST1
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
     2.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       
     3.0   test           11/12 15:44   0+00:00:21 I  0   0.0  parallel.sh       
     4.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       
     5.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       

  4 jobs; 0 completed, 0 removed, 4 idle, 0 running, 0 held, 0 suspended
  State = "Unclaimed"
  State = "Unclaimed"
  State = "Unclaimed"
  State = "Unclaimed"

# condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s
  -- Submitter: HOST1 : <IP:38202> : HOST1
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
     2.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       
     3.0   test           11/12 15:44   0+00:00:28 R  0   0.0  parallel.sh       
     4.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       
     5.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       

  4 jobs; 0 completed, 0 removed, 3 idle, 1 running, 0 held, 0 suspended
  JobId = "3.0"
  State = "Claimed"
  JobId = "3.0"
  State = "Claimed"
  JobId = "3.0"
  State = "Claimed"
  JobId = "3.0"
  State = "Claimed"


# tail -F ShadowLog | grep ParallelShadow
  11/12/12 15:44:28 (1.0) (21553): In ParallelShadow::startMaster()
  11/12/12 15:44:54 (1.0) (21553): ParallelShadow::shutDown WAIT_FOR_ALL - 1 resources out of 4 have checked in 
  11/12/12 15:44:54 (1.0) (21553): ParallelShadow::shutDown WAIT_FOR_ALL - 2 resources out of 4 have checked in 
  11/12/12 15:44:54 (1.0) (21553): ParallelShadow::shutDown WAIT_FOR_ALL - 3 resources out of 4 have checked in 
  11/12/12 15:44:54 (1.0) (21553): ParallelShadow::shutDown WAIT_FOR_ALL - All resources have called exit/shutdown
  11/12/12 15:44:57 (3.0) (21820): In ParallelShadow::startMaster()
  11/12/12 15:45:43 (3.0) (22208): In ParallelShadow::startMaster()

>>> ASSIGNED
Comment 19 Daniel Horák 2012-11-19 09:43:09 EST
Tested and verified on RHEL 5.9/6.4 - i386/x86_64 (output from RHEL 5.9 i386):

# rpm -q condor
  condor-7.8.7-0.5.el5.i386

$ condor_submit parallel.job 
  Submitting job(s).
  1 job(s) submitted to cluster 1.
$ condor_submit parallel.job 
  Submitting job(s).
  1 job(s) submitted to cluster 2.

# condor_q && condor_status
  -- Submitter: dhcp-27-194.brq.redhat.com : <10.34.27.194:60123> : dhcp-27-194.brq.redhat.com
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
     1.0   test           11/16 10:24   0+00:00:00 I  0   0.0  parallel.sh       
     2.0   test           11/16 10:24   0+00:00:00 I  0   0.0  parallel.sh       

  2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended

  Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

  dhcp-27-110.brq.re LINUX      INTEL  Unclaimed Idle     0.280   502  0+00:00:04
  dhcp-27-111.brq.re LINUX      INTEL  Unclaimed Idle     0.300   502  0+00:00:04
  dhcp-27-112.brq.re LINUX      INTEL  Unclaimed Idle     0.410   502  0+00:00:04
  dhcp-27-194.brq.re LINUX      INTEL  Unclaimed Idle     0.740   502  0+00:00:04
                       Machines Owner Claimed Unclaimed Matched Preempting

           INTEL/LINUX        4     0       0         4       0          0

                 Total        4     0       0         4       0          0
# condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s
  -- Submitter: dhcp-27-194.brq.redhat.com : <10.34.27.194:60123> : dhcp-27-194.brq.redhat.com
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
     1.0   test           11/16 10:24   0+00:00:05 R  0   0.0  parallel.sh       
     2.0   test           11/16 10:24   0+00:00:00 I  0   0.0  parallel.sh       

  2 jobs; 0 completed, 0 removed, 1 idle, 1 running, 0 held, 0 suspended
  JobId = "1.0"
  State = "Claimed"
  JobId = "1.0"
  State = "Claimed"
  JobId = "1.0"
  State = "Claimed"
  JobId = "1.0"
  State = "Claimed"

# condor_rm 1
  Cluster 1 has been marked for removal.

# condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s


  -- Submitter: dhcp-27-194.brq.redhat.com : <10.34.27.194:60123> : dhcp-27-194.brq.redhat.com
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
     2.0   test           11/16 10:24   0+00:00:08 R  0   0.0  parallel.sh       

  1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
  JobId = "2.0"
  State = "Claimed"
  JobId = "2.0"
  State = "Claimed"
  JobId = "2.0"
  State = "Claimed"
  JobId = "2.0"
  State = "Claimed"


# tail -F ShadowLog | grep ParallelShadow
  11/16/12 10:25:06 (1.0) (29510): In ParallelShadow::startMaster()
  11/16/12 10:25:16 (1.0) (29510): ParallelShadow::shutDown WAIT_FOR_ALL Not all resources have FINISHED11/16/12 10:25:16 (1.0) (29510): About to decode condor_sysnum
  11/16/12 10:25:16 (1.0) (29510): ParallelShadow::shutDown WAIT_FOR_ALL Not all resources have FINISHED11/16/12 10:25:16 (1.0) (29510): About to decode condor_sysnum
  11/16/12 10:25:16 (1.0) (29510): ParallelShadow::shutDown WAIT_FOR_ALL Not all resources have FINISHED11/16/12 10:25:16 (1.0) (29510): About to decode condor_sysnum
  11/16/12 10:25:16 (1.0) (29510): ParallelShadow::shutDown WAIT_FOR_ALL Not all resources have FINISHED11/16/12 10:25:16 (1.0) (29510): About to decode condor_sysnum
  11/16/12 10:25:16 (1.0) (29510): ParallelShadow::shutDown WAIT_FOR_ALL Not all resources have FINISHED11/16/12 10:25:16 (1.0) (29510): About to decode condor_sysnum
  11/16/12 10:25:16 (1.0) (29510): ParallelShadow::shutDown WAIT_FOR_ALL - All(4) resources have called exit/shutdown
  11/16/12 10:25:20 (2.0) (29802): In ParallelShadow::startMaster()
  11/16/12 10:26:21 (2.0) (29802): ParallelShadow::shutDown WAIT_FOR_ALL Not all resources have FINISHED11/16/12 10:26:21 (2.0) (29802): entering FileTransfer::HandleCommands
  11/16/12 10:26:21 (2.0) (29802): ParallelShadow::shutDown WAIT_FOR_ALL Not all resources have FINISHED11/16/12 10:26:21 (2.0) (29802): entering FileTransfer::HandleCommands
  11/16/12 10:26:21 (2.0) (29802): ParallelShadow::shutDown WAIT_FOR_ALL Not all resources have FINISHED11/16/12 10:26:21 (2.0) (29802): About to decode condor_sysnum
  11/16/12 10:26:21 (2.0) (29802): ParallelShadow::shutDown WAIT_FOR_ALL - All(4) resources have called exit/shutdown


Jobs are correctly launched after all used slots are correctly cleaned from previous job.

>>> VERIFIED
Comment 21 errata-xmlrpc 2013-03-06 13:43:34 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0564.html

Note You need to log in before you can comment on or make changes to this bug.