Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 809732

Summary:

PU job is runned before slots are cleaned from previous (removed) job

Product:

Red Hat Enterprise MRG

Reporter:

Daniel Horák <dahorak>

Component:

condor

Assignee:

Timothy St. Clair <tstclair>

Status:

CLOSED ERRATA

QA Contact:

Daniel Horák <dahorak>

Severity:

low

Docs Contact:

Priority:

high

Version:

2.1.1

CC:

iboverma, ltoscano, matt, tstclair

Target Milestone:

2.3

Keywords:

Reopened

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

condor-7.8.7-0.5

Doc Type:

Bug Fix

Doc Text:

Cause: Remove a parallel universe job whose claim can be resused. Consequence: Job status can be invalid during a case where a parallel universe job is removed and whose claim can be reused. Fix: Update job status reporting. Result: Job status is accurate.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-03-06 18:43:34 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
condor logs	none

Description Daniel Horák 2012-04-04 08:42:30 UTC

Description of problem:
  Immediately after manually remove running PU job, is another job marked as running, but slots are engaged by the previous job and the new one is after 20s returned back to idle state. Then after a while is job again (and this time correctly) runned.


Version-Release number of selected component (if applicable):
$ rpm -qa | grep condor
  condor-classads-7.6.5-0.12.el5
  condor-7.6.5-0.12.el5


How reproducible:
  100%


Steps to Reproduce:
1. Configure condor pool for PU universe:
# cat /etc/condor/99_pu.config
  CONDOR_HOST=HOST
  ALLOW_WRITE = *
  ALLOW_READ = *
  ALLOW_ADMINISTRATOR = *
  NEGOTIATOR_INTERVAL = 30
  SCHEDD_INTERVAL = 15
  COLLECTOR_UPDATE_INTERVAL = 15
  NUM_CPUS = 1
  DedicatedScheduler="DedicatedScheduler@HOST"
  STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler
  START		= True
  SUSPEND	= False
  CONTINUE	= True
  PREEMPT	= False
  KILL		= False
  WANT_SUSPEND	= False
  WANT_VACATE	= False
  RANK		= Scheduler =?= $(DedicatedScheduler)
  CONDOR_SSHD = /usr/sbin/sshd
  CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
  MPI_CONDOR_RSH_PATH = $(LIBEXEC)
  UID_DOMAIN = *
  #DAEMON_LIST = MASTER, STARTD


2. Submit few simple PU jobs.
$ cat parallel.job
  universe = parallel
  executable = parallel.sh
  args = "60"
  log = /tmp/parallel_job.$(cluster).$(process).log
  output = /tmp/parallel_job.$(cluster).$(process)-$(NODE).out
  error = /tmp/parallel_job.$(cluster).$(process)-$(NODE).err
  machine_count = 4
  should_transfer_files = yes
  when_to_transfer_output = on_exit
  requirements = (FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED)
  +ParallelShutdownPolicy = "WAIT_FOR_ALL"
  queue

$ cat parallel.sh
  #!/usr/bin/env bash
  echo "hostname: $(hostname)"
  echo "whoami: $(whoami)"
  echo "id: $(id)"
  sleep 0$1
  echo $@


3. Wait until one job is running and then remove it (and watch condor_q and condor_status):
...
$ condor_submit parallel.job
  Submitting job(s).
  1 job(s) submitted to cluster 4.

$ condor_q && condor_status
  -- Submitter: HOST : <IP:36111> : HOST
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
     1.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh
     2.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh
     3.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh
     4.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh
  4 jobs; 0 completed, 0 removed, 4 idle, 0 running, 0 held, 0 suspended

  Name  OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

  HOST1 LINUX      INTEL  Unclaimed Idle     0.100   502  0+00:00:04
  HOST2 LINUX      INTEL  Unclaimed Idle     0.000   499  0+00:00:04
  HOST3 LINUX      X86_64 Unclaimed Idle     0.140   497  0+00:00:04
  HOST4 LINUX      X86_64 Unclaimed Idle     0.060   490  0+00:00:04

                       Machines Owner Claimed Unclaimed Matched Preempting
           INTEL/LINUX        2     0       0         2       0          0
          X86_64/LINUX        2     0       0         2       0          0
                 Total        4     0       0         4       0          0

$ condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s
  -- Submitter: HOST : <IP:36111> : HOST
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
     1.0   test            4/4  09:41   0+00:00:17 R  0   0.0  parallel.sh
     2.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh
     3.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh
     4.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh

  4 jobs; 0 completed, 0 removed, 3 idle, 1 running, 0 held, 0 suspended

  JobId = "1.0"
  State = "Claimed"
  JobId = "1.0"
  State = "Claimed"
  JobId = "1.0"
  State = "Claimed"
  JobId = "1.0"
  State = "Claimed"

$ condor_rm 1.0
  Job 1.0 marked for removal

$ condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s
  -- Submitter: HOST : <IP:36111> : HOST
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
     2.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh
     3.0   test            4/4  09:41   0+00:00:16 R  0   0.0  parallel.sh
     4.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh

  3 jobs; 0 completed, 0 removed, 2 idle, 1 running, 0 held, 0 suspended

  JobId = "1.0"
  State = "Claimed"
  JobId = "3.0"
  State = "Claimed"
  JobId = "1.0"
  State = "Claimed"
  JobId = "1.0"
  State = "Claimed"

$ condor_q && condor_status
  -- Submitter: HOST : <IP:36111> : HOST
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
     2.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh
     3.0   test            4/4  09:41   0+00:00:21 I  0   0.0  parallel.sh
     4.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh

  3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended

  Name  OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

  HOST1 LINUX      INTEL  Unclaimed Idle     0.300   502  0+00:00:04
  HOST2 LINUX      INTEL  Unclaimed Idle     0.050   499  0+00:00:04
  HOST3 LINUX      X86_64 Unclaimed Idle     0.150   497  0+00:00:04
  HOST4 LINUX      X86_64 Unclaimed Idle     0.100   490  0+00:00:04

                       Machines Owner Claimed Unclaimed Matched Preempting

           INTEL/LINUX        2     0       0         2       0          0
          X86_64/LINUX        2     0       0         2       0          0

                 Total        4     0       0         4       0          0

$ condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s
  -- Submitter: HOST : <IP:36111> : HOST
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
     2.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh
     3.0   test            4/4  09:41   0+00:00:38 R  0   0.0  parallel.sh
     4.0   test            4/4  09:41   0+00:00:00 I  0   0.0  parallel.sh

  3 jobs; 0 completed, 0 removed, 2 idle, 1 running, 0 held, 0 suspended

  JobId = "3.0"
  State = "Claimed"
  JobId = "3.0"
  State = "Claimed"
  JobId = "3.0"
  State = "Claimed"
  JobId = "3.0"
  State = "Claimed"


Actual results:
  Job is marked as RUNNING immediately after previous one is removed, but it isn't correctly running because slots are for a while occupied by the removed job.

Expected results:
  Job isn't runned before all slots are correctly cleaned from previous job.

Additional info:

Comment 2 Timothy St. Clair 2012-08-30 19:13:08 UTC

Testing against condor-7.8 series it does not repro

Comment 7 Timothy St. Clair 2012-09-07 16:52:32 UTC

In tracing the logs on the Q/E repro machine I see the case is indeed timing based.  

The following events occur:

1.) Schedd receives condor_rm event sends message to shadow
2.) Shadow notify starters
3.) Starters try to cleanup sending message to startd 
4.) Shadow does not hang around waiting for cleanup
5.) Startd && Starter hang in a holding pattern waiting for shadow
6.) Schedd spins a new shadow trying to reuse claim for next job
7.) New shadow waits around for cleanup of old job which never happens. 
8.) Shadow gives up 
9.) Claim released.

Comment 8 Timothy St. Clair 2012-09-07 19:18:57 UTC

Created attachment 610832 [details]
condor logs

Shows failure case in ShadowLog and StarterLog

Comment 12 Timothy St. Clair 2012-09-12 14:25:51 UTC

The fix now waits for all starters final shutdown call which depends on 

+ParallelShutdownPolicy = "WAIT_FOR_ALL"

To verify perform the test mentioned and check for the following ShadowLog message w/ D_FULLDEBUG:

"ParallelShadow::shutDown·WAIT_FOR_ALL - All resources have called exit/shutdown"

Comment 14 Daniel Horák 2012-11-12 14:59:10 UTC

Tested on new packages condor-7.8.7-0.4.el5.i386 on RHEL 5.9 i386 with 4 machines in pool and it doesn't work (expected log message along to comment 12 is present, but the behaviour is same as described in comment 0):

# rpm -q condor
  condor-7.8.7-0.4.el5.i386

# condor_status
  Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

  HOST1 LINUX      INTEL  Claimed   Busy     0.210   502  0+00:00:03
  HOST2 LINUX      INTEL  Claimed   Busy     0.270   502  0+00:00:02
  HOST3 LINUX      INTEL  Claimed   Busy     0.270   502  0+00:00:03
  HOST4 LINUX      INTEL  Claimed   Busy     0.480   502  0+00:00:02
                       Machines Owner Claimed Unclaimed Matched Preempting

           INTEL/LINUX        4     0       4         0       0          0

                 Total        4     0       4         0       0          0

# condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s
  -- Submitter: HOST1 : <IP:38202> : HOST1
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
     1.0   test           11/12 15:44   0+00:00:14 R  0   0.0  parallel.sh       
     2.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       
     3.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       
     4.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       
     5.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       

  5 jobs; 0 completed, 0 removed, 4 idle, 1 running, 0 held, 0 suspended
  JobId = "1.0"
  State = "Claimed"
  JobId = "1.0"
  State = "Claimed"
  JobId = "1.0"
  State = "Claimed"
  JobId = "1.0"
  State = "Claimed"

$ condor_rm 1
  Cluster 1 has been marked for removal.

# condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s
  -- Submitter: HOST1 : <IP:38202> : HOST1
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
     2.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       
     3.0   test           11/12 15:44   0+00:00:05 R  0   0.0  parallel.sh       
     4.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       
     5.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       

  4 jobs; 0 completed, 0 removed, 3 idle, 1 running, 0 held, 0 suspended
  JobId = "3.0"
  State = "Claimed"
  JobId = "3.0"
  State = "Claimed"
  JobId = "1.0"
  State = "Claimed"
  JobId = "3.0"
  State = "Claimed"

# condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s
  -- Submitter: HOST1 : <IP:38202> : HOST1
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
     2.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       
     3.0   test           11/12 15:44   0+00:00:21 I  0   0.0  parallel.sh       
     4.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       
     5.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       

  4 jobs; 0 completed, 0 removed, 4 idle, 0 running, 0 held, 0 suspended
  State = "Unclaimed"
  State = "Unclaimed"
  State = "Unclaimed"
  State = "Unclaimed"

# condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s
  -- Submitter: HOST1 : <IP:38202> : HOST1
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
     2.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       
     3.0   test           11/12 15:44   0+00:00:28 R  0   0.0  parallel.sh       
     4.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       
     5.0   test           11/12 15:44   0+00:00:00 I  0   0.0  parallel.sh       

  4 jobs; 0 completed, 0 removed, 3 idle, 1 running, 0 held, 0 suspended
  JobId = "3.0"
  State = "Claimed"
  JobId = "3.0"
  State = "Claimed"
  JobId = "3.0"
  State = "Claimed"
  JobId = "3.0"
  State = "Claimed"


# tail -F ShadowLog | grep ParallelShadow
  11/12/12 15:44:28 (1.0) (21553): In ParallelShadow::startMaster()
  11/12/12 15:44:54 (1.0) (21553): ParallelShadow::shutDown WAIT_FOR_ALL - 1 resources out of 4 have checked in 
  11/12/12 15:44:54 (1.0) (21553): ParallelShadow::shutDown WAIT_FOR_ALL - 2 resources out of 4 have checked in 
  11/12/12 15:44:54 (1.0) (21553): ParallelShadow::shutDown WAIT_FOR_ALL - 3 resources out of 4 have checked in 
  11/12/12 15:44:54 (1.0) (21553): ParallelShadow::shutDown WAIT_FOR_ALL - All resources have called exit/shutdown
  11/12/12 15:44:57 (3.0) (21820): In ParallelShadow::startMaster()
  11/12/12 15:45:43 (3.0) (22208): In ParallelShadow::startMaster()

>>> ASSIGNED

Comment 19 Daniel Horák 2012-11-19 14:43:09 UTC

Tested and verified on RHEL 5.9/6.4 - i386/x86_64 (output from RHEL 5.9 i386):

# rpm -q condor
  condor-7.8.7-0.5.el5.i386

$ condor_submit parallel.job 
  Submitting job(s).
  1 job(s) submitted to cluster 1.
$ condor_submit parallel.job 
  Submitting job(s).
  1 job(s) submitted to cluster 2.

# condor_q && condor_status
  -- Submitter: dhcp-27-194.brq.redhat.com : <10.34.27.194:60123> : dhcp-27-194.brq.redhat.com
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
     1.0   test           11/16 10:24   0+00:00:00 I  0   0.0  parallel.sh       
     2.0   test           11/16 10:24   0+00:00:00 I  0   0.0  parallel.sh       

  2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended

  Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

  dhcp-27-110.brq.re LINUX      INTEL  Unclaimed Idle     0.280   502  0+00:00:04
  dhcp-27-111.brq.re LINUX      INTEL  Unclaimed Idle     0.300   502  0+00:00:04
  dhcp-27-112.brq.re LINUX      INTEL  Unclaimed Idle     0.410   502  0+00:00:04
  dhcp-27-194.brq.re LINUX      INTEL  Unclaimed Idle     0.740   502  0+00:00:04
                       Machines Owner Claimed Unclaimed Matched Preempting

           INTEL/LINUX        4     0       0         4       0          0

                 Total        4     0       0         4       0          0
# condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s
  -- Submitter: dhcp-27-194.brq.redhat.com : <10.34.27.194:60123> : dhcp-27-194.brq.redhat.com
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
     1.0   test           11/16 10:24   0+00:00:05 R  0   0.0  parallel.sh       
     2.0   test           11/16 10:24   0+00:00:00 I  0   0.0  parallel.sh       

  2 jobs; 0 completed, 0 removed, 1 idle, 1 running, 0 held, 0 suspended
  JobId = "1.0"
  State = "Claimed"
  JobId = "1.0"
  State = "Claimed"
  JobId = "1.0"
  State = "Claimed"
  JobId = "1.0"
  State = "Claimed"

# condor_rm 1
  Cluster 1 has been marked for removal.

# condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s


  -- Submitter: dhcp-27-194.brq.redhat.com : <10.34.27.194:60123> : dhcp-27-194.brq.redhat.com
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
     2.0   test           11/16 10:24   0+00:00:08 R  0   0.0  parallel.sh       

  1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
  JobId = "2.0"
  State = "Claimed"
  JobId = "2.0"
  State = "Claimed"
  JobId = "2.0"
  State = "Claimed"
  JobId = "2.0"
  State = "Claimed"


# tail -F ShadowLog | grep ParallelShadow
  11/16/12 10:25:06 (1.0) (29510): In ParallelShadow::startMaster()
  11/16/12 10:25:16 (1.0) (29510): ParallelShadow::shutDown WAIT_FOR_ALL Not all resources have FINISHED11/16/12 10:25:16 (1.0) (29510): About to decode condor_sysnum
  11/16/12 10:25:16 (1.0) (29510): ParallelShadow::shutDown WAIT_FOR_ALL Not all resources have FINISHED11/16/12 10:25:16 (1.0) (29510): About to decode condor_sysnum
  11/16/12 10:25:16 (1.0) (29510): ParallelShadow::shutDown WAIT_FOR_ALL Not all resources have FINISHED11/16/12 10:25:16 (1.0) (29510): About to decode condor_sysnum
  11/16/12 10:25:16 (1.0) (29510): ParallelShadow::shutDown WAIT_FOR_ALL Not all resources have FINISHED11/16/12 10:25:16 (1.0) (29510): About to decode condor_sysnum
  11/16/12 10:25:16 (1.0) (29510): ParallelShadow::shutDown WAIT_FOR_ALL Not all resources have FINISHED11/16/12 10:25:16 (1.0) (29510): About to decode condor_sysnum
  11/16/12 10:25:16 (1.0) (29510): ParallelShadow::shutDown WAIT_FOR_ALL - All(4) resources have called exit/shutdown
  11/16/12 10:25:20 (2.0) (29802): In ParallelShadow::startMaster()
  11/16/12 10:26:21 (2.0) (29802): ParallelShadow::shutDown WAIT_FOR_ALL Not all resources have FINISHED11/16/12 10:26:21 (2.0) (29802): entering FileTransfer::HandleCommands
  11/16/12 10:26:21 (2.0) (29802): ParallelShadow::shutDown WAIT_FOR_ALL Not all resources have FINISHED11/16/12 10:26:21 (2.0) (29802): entering FileTransfer::HandleCommands
  11/16/12 10:26:21 (2.0) (29802): ParallelShadow::shutDown WAIT_FOR_ALL Not all resources have FINISHED11/16/12 10:26:21 (2.0) (29802): About to decode condor_sysnum
  11/16/12 10:26:21 (2.0) (29802): ParallelShadow::shutDown WAIT_FOR_ALL - All(4) resources have called exit/shutdown


Jobs are correctly launched after all used slots are correctly cleaned from previous job.

>>> VERIFIED

Comment 21 errata-xmlrpc 2013-03-06 18:43:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0564.html