Bug 809732
| Summary: | PU job is runned before slots are cleaned from previous (removed) job | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Daniel Horák <dahorak> | ||||
| Component: | condor | Assignee: | Timothy St. Clair <tstclair> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Daniel Horák <dahorak> | ||||
| Severity: | low | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 2.1.1 | CC: | iboverma, ltoscano, matt, tstclair | ||||
| Target Milestone: | 2.3 | Keywords: | Reopened | ||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | condor-7.8.7-0.5 | Doc Type: | Bug Fix | ||||
| Doc Text: |
Cause:
Remove a parallel universe job whose claim can be resused.
Consequence:
Job status can be invalid during a case where a parallel universe job is removed and whose claim can be reused.
Fix:
Update job status reporting.
Result:
Job status is accurate.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2013-03-06 18:43:34 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Testing against condor-7.8 series it does not repro In tracing the logs on the Q/E repro machine I see the case is indeed timing based. The following events occur: 1.) Schedd receives condor_rm event sends message to shadow 2.) Shadow notify starters 3.) Starters try to cleanup sending message to startd 4.) Shadow does not hang around waiting for cleanup 5.) Startd && Starter hang in a holding pattern waiting for shadow 6.) Schedd spins a new shadow trying to reuse claim for next job 7.) New shadow waits around for cleanup of old job which never happens. 8.) Shadow gives up 9.) Claim released. Created attachment 610832 [details]
condor logs
Shows failure case in ShadowLog and StarterLog
The fix now waits for all starters final shutdown call which depends on +ParallelShutdownPolicy = "WAIT_FOR_ALL" To verify perform the test mentioned and check for the following ShadowLog message w/ D_FULLDEBUG: "ParallelShadow::shutDown·WAIT_FOR_ALL - All resources have called exit/shutdown" Tested on new packages condor-7.8.7-0.4.el5.i386 on RHEL 5.9 i386 with 4 machines in pool and it doesn't work (expected log message along to comment 12 is present, but the behaviour is same as described in comment 0): # rpm -q condor condor-7.8.7-0.4.el5.i386 # condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime HOST1 LINUX INTEL Claimed Busy 0.210 502 0+00:00:03 HOST2 LINUX INTEL Claimed Busy 0.270 502 0+00:00:02 HOST3 LINUX INTEL Claimed Busy 0.270 502 0+00:00:03 HOST4 LINUX INTEL Claimed Busy 0.480 502 0+00:00:02 Machines Owner Claimed Unclaimed Matched Preempting INTEL/LINUX 4 0 4 0 0 0 Total 4 0 4 0 0 0 # condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s -- Submitter: HOST1 : <IP:38202> : HOST1 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 test 11/12 15:44 0+00:00:14 R 0 0.0 parallel.sh 2.0 test 11/12 15:44 0+00:00:00 I 0 0.0 parallel.sh 3.0 test 11/12 15:44 0+00:00:00 I 0 0.0 parallel.sh 4.0 test 11/12 15:44 0+00:00:00 I 0 0.0 parallel.sh 5.0 test 11/12 15:44 0+00:00:00 I 0 0.0 parallel.sh 5 jobs; 0 completed, 0 removed, 4 idle, 1 running, 0 held, 0 suspended JobId = "1.0" State = "Claimed" JobId = "1.0" State = "Claimed" JobId = "1.0" State = "Claimed" JobId = "1.0" State = "Claimed" $ condor_rm 1 Cluster 1 has been marked for removal. # condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s -- Submitter: HOST1 : <IP:38202> : HOST1 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 test 11/12 15:44 0+00:00:00 I 0 0.0 parallel.sh 3.0 test 11/12 15:44 0+00:00:05 R 0 0.0 parallel.sh 4.0 test 11/12 15:44 0+00:00:00 I 0 0.0 parallel.sh 5.0 test 11/12 15:44 0+00:00:00 I 0 0.0 parallel.sh 4 jobs; 0 completed, 0 removed, 3 idle, 1 running, 0 held, 0 suspended JobId = "3.0" State = "Claimed" JobId = "3.0" State = "Claimed" JobId = "1.0" State = "Claimed" JobId = "3.0" State = "Claimed" # condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s -- Submitter: HOST1 : <IP:38202> : HOST1 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 test 11/12 15:44 0+00:00:00 I 0 0.0 parallel.sh 3.0 test 11/12 15:44 0+00:00:21 I 0 0.0 parallel.sh 4.0 test 11/12 15:44 0+00:00:00 I 0 0.0 parallel.sh 5.0 test 11/12 15:44 0+00:00:00 I 0 0.0 parallel.sh 4 jobs; 0 completed, 0 removed, 4 idle, 0 running, 0 held, 0 suspended State = "Unclaimed" State = "Unclaimed" State = "Unclaimed" State = "Unclaimed" # condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s -- Submitter: HOST1 : <IP:38202> : HOST1 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 test 11/12 15:44 0+00:00:00 I 0 0.0 parallel.sh 3.0 test 11/12 15:44 0+00:00:28 R 0 0.0 parallel.sh 4.0 test 11/12 15:44 0+00:00:00 I 0 0.0 parallel.sh 5.0 test 11/12 15:44 0+00:00:00 I 0 0.0 parallel.sh 4 jobs; 0 completed, 0 removed, 3 idle, 1 running, 0 held, 0 suspended JobId = "3.0" State = "Claimed" JobId = "3.0" State = "Claimed" JobId = "3.0" State = "Claimed" JobId = "3.0" State = "Claimed" # tail -F ShadowLog | grep ParallelShadow 11/12/12 15:44:28 (1.0) (21553): In ParallelShadow::startMaster() 11/12/12 15:44:54 (1.0) (21553): ParallelShadow::shutDown WAIT_FOR_ALL - 1 resources out of 4 have checked in 11/12/12 15:44:54 (1.0) (21553): ParallelShadow::shutDown WAIT_FOR_ALL - 2 resources out of 4 have checked in 11/12/12 15:44:54 (1.0) (21553): ParallelShadow::shutDown WAIT_FOR_ALL - 3 resources out of 4 have checked in 11/12/12 15:44:54 (1.0) (21553): ParallelShadow::shutDown WAIT_FOR_ALL - All resources have called exit/shutdown 11/12/12 15:44:57 (3.0) (21820): In ParallelShadow::startMaster() 11/12/12 15:45:43 (3.0) (22208): In ParallelShadow::startMaster() >>> ASSIGNED Tested and verified on RHEL 5.9/6.4 - i386/x86_64 (output from RHEL 5.9 i386):
# rpm -q condor
condor-7.8.7-0.5.el5.i386
$ condor_submit parallel.job
Submitting job(s).
1 job(s) submitted to cluster 1.
$ condor_submit parallel.job
Submitting job(s).
1 job(s) submitted to cluster 2.
# condor_q && condor_status
-- Submitter: dhcp-27-194.brq.redhat.com : <10.34.27.194:60123> : dhcp-27-194.brq.redhat.com
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1.0 test 11/16 10:24 0+00:00:00 I 0 0.0 parallel.sh
2.0 test 11/16 10:24 0+00:00:00 I 0 0.0 parallel.sh
2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
dhcp-27-110.brq.re LINUX INTEL Unclaimed Idle 0.280 502 0+00:00:04
dhcp-27-111.brq.re LINUX INTEL Unclaimed Idle 0.300 502 0+00:00:04
dhcp-27-112.brq.re LINUX INTEL Unclaimed Idle 0.410 502 0+00:00:04
dhcp-27-194.brq.re LINUX INTEL Unclaimed Idle 0.740 502 0+00:00:04
Machines Owner Claimed Unclaimed Matched Preempting
INTEL/LINUX 4 0 0 4 0 0
Total 4 0 0 4 0 0
# condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s
-- Submitter: dhcp-27-194.brq.redhat.com : <10.34.27.194:60123> : dhcp-27-194.brq.redhat.com
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1.0 test 11/16 10:24 0+00:00:05 R 0 0.0 parallel.sh
2.0 test 11/16 10:24 0+00:00:00 I 0 0.0 parallel.sh
2 jobs; 0 completed, 0 removed, 1 idle, 1 running, 0 held, 0 suspended
JobId = "1.0"
State = "Claimed"
JobId = "1.0"
State = "Claimed"
JobId = "1.0"
State = "Claimed"
JobId = "1.0"
State = "Claimed"
# condor_rm 1
Cluster 1 has been marked for removal.
# condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s
-- Submitter: dhcp-27-194.brq.redhat.com : <10.34.27.194:60123> : dhcp-27-194.brq.redhat.com
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
2.0 test 11/16 10:24 0+00:00:08 R 0 0.0 parallel.sh
1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
JobId = "2.0"
State = "Claimed"
JobId = "2.0"
State = "Claimed"
JobId = "2.0"
State = "Claimed"
JobId = "2.0"
State = "Claimed"
# tail -F ShadowLog | grep ParallelShadow
11/16/12 10:25:06 (1.0) (29510): In ParallelShadow::startMaster()
11/16/12 10:25:16 (1.0) (29510): ParallelShadow::shutDown WAIT_FOR_ALL Not all resources have FINISHED11/16/12 10:25:16 (1.0) (29510): About to decode condor_sysnum
11/16/12 10:25:16 (1.0) (29510): ParallelShadow::shutDown WAIT_FOR_ALL Not all resources have FINISHED11/16/12 10:25:16 (1.0) (29510): About to decode condor_sysnum
11/16/12 10:25:16 (1.0) (29510): ParallelShadow::shutDown WAIT_FOR_ALL Not all resources have FINISHED11/16/12 10:25:16 (1.0) (29510): About to decode condor_sysnum
11/16/12 10:25:16 (1.0) (29510): ParallelShadow::shutDown WAIT_FOR_ALL Not all resources have FINISHED11/16/12 10:25:16 (1.0) (29510): About to decode condor_sysnum
11/16/12 10:25:16 (1.0) (29510): ParallelShadow::shutDown WAIT_FOR_ALL Not all resources have FINISHED11/16/12 10:25:16 (1.0) (29510): About to decode condor_sysnum
11/16/12 10:25:16 (1.0) (29510): ParallelShadow::shutDown WAIT_FOR_ALL - All(4) resources have called exit/shutdown
11/16/12 10:25:20 (2.0) (29802): In ParallelShadow::startMaster()
11/16/12 10:26:21 (2.0) (29802): ParallelShadow::shutDown WAIT_FOR_ALL Not all resources have FINISHED11/16/12 10:26:21 (2.0) (29802): entering FileTransfer::HandleCommands
11/16/12 10:26:21 (2.0) (29802): ParallelShadow::shutDown WAIT_FOR_ALL Not all resources have FINISHED11/16/12 10:26:21 (2.0) (29802): entering FileTransfer::HandleCommands
11/16/12 10:26:21 (2.0) (29802): ParallelShadow::shutDown WAIT_FOR_ALL Not all resources have FINISHED11/16/12 10:26:21 (2.0) (29802): About to decode condor_sysnum
11/16/12 10:26:21 (2.0) (29802): ParallelShadow::shutDown WAIT_FOR_ALL - All(4) resources have called exit/shutdown
Jobs are correctly launched after all used slots are correctly cleaned from previous job.
>>> VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0564.html |
Description of problem: Immediately after manually remove running PU job, is another job marked as running, but slots are engaged by the previous job and the new one is after 20s returned back to idle state. Then after a while is job again (and this time correctly) runned. Version-Release number of selected component (if applicable): $ rpm -qa | grep condor condor-classads-7.6.5-0.12.el5 condor-7.6.5-0.12.el5 How reproducible: 100% Steps to Reproduce: 1. Configure condor pool for PU universe: # cat /etc/condor/99_pu.config CONDOR_HOST=HOST ALLOW_WRITE = * ALLOW_READ = * ALLOW_ADMINISTRATOR = * NEGOTIATOR_INTERVAL = 30 SCHEDD_INTERVAL = 15 COLLECTOR_UPDATE_INTERVAL = 15 NUM_CPUS = 1 DedicatedScheduler="DedicatedScheduler@HOST" STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler START = True SUSPEND = False CONTINUE = True PREEMPT = False KILL = False WANT_SUSPEND = False WANT_VACATE = False RANK = Scheduler =?= $(DedicatedScheduler) CONDOR_SSHD = /usr/sbin/sshd CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen MPI_CONDOR_RSH_PATH = $(LIBEXEC) UID_DOMAIN = * #DAEMON_LIST = MASTER, STARTD 2. Submit few simple PU jobs. $ cat parallel.job universe = parallel executable = parallel.sh args = "60" log = /tmp/parallel_job.$(cluster).$(process).log output = /tmp/parallel_job.$(cluster).$(process)-$(NODE).out error = /tmp/parallel_job.$(cluster).$(process)-$(NODE).err machine_count = 4 should_transfer_files = yes when_to_transfer_output = on_exit requirements = (FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED) +ParallelShutdownPolicy = "WAIT_FOR_ALL" queue $ cat parallel.sh #!/usr/bin/env bash echo "hostname: $(hostname)" echo "whoami: $(whoami)" echo "id: $(id)" sleep 0$1 echo $@ 3. Wait until one job is running and then remove it (and watch condor_q and condor_status): ... $ condor_submit parallel.job Submitting job(s). 1 job(s) submitted to cluster 4. $ condor_q && condor_status -- Submitter: HOST : <IP:36111> : HOST ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 test 4/4 09:41 0+00:00:00 I 0 0.0 parallel.sh 2.0 test 4/4 09:41 0+00:00:00 I 0 0.0 parallel.sh 3.0 test 4/4 09:41 0+00:00:00 I 0 0.0 parallel.sh 4.0 test 4/4 09:41 0+00:00:00 I 0 0.0 parallel.sh 4 jobs; 0 completed, 0 removed, 4 idle, 0 running, 0 held, 0 suspended Name OpSys Arch State Activity LoadAv Mem ActvtyTime HOST1 LINUX INTEL Unclaimed Idle 0.100 502 0+00:00:04 HOST2 LINUX INTEL Unclaimed Idle 0.000 499 0+00:00:04 HOST3 LINUX X86_64 Unclaimed Idle 0.140 497 0+00:00:04 HOST4 LINUX X86_64 Unclaimed Idle 0.060 490 0+00:00:04 Machines Owner Claimed Unclaimed Matched Preempting INTEL/LINUX 2 0 0 2 0 0 X86_64/LINUX 2 0 0 2 0 0 Total 4 0 0 4 0 0 $ condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s -- Submitter: HOST : <IP:36111> : HOST ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 test 4/4 09:41 0+00:00:17 R 0 0.0 parallel.sh 2.0 test 4/4 09:41 0+00:00:00 I 0 0.0 parallel.sh 3.0 test 4/4 09:41 0+00:00:00 I 0 0.0 parallel.sh 4.0 test 4/4 09:41 0+00:00:00 I 0 0.0 parallel.sh 4 jobs; 0 completed, 0 removed, 3 idle, 1 running, 0 held, 0 suspended JobId = "1.0" State = "Claimed" JobId = "1.0" State = "Claimed" JobId = "1.0" State = "Claimed" JobId = "1.0" State = "Claimed" $ condor_rm 1.0 Job 1.0 marked for removal $ condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s -- Submitter: HOST : <IP:36111> : HOST ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 test 4/4 09:41 0+00:00:00 I 0 0.0 parallel.sh 3.0 test 4/4 09:41 0+00:00:16 R 0 0.0 parallel.sh 4.0 test 4/4 09:41 0+00:00:00 I 0 0.0 parallel.sh 3 jobs; 0 completed, 0 removed, 2 idle, 1 running, 0 held, 0 suspended JobId = "1.0" State = "Claimed" JobId = "3.0" State = "Claimed" JobId = "1.0" State = "Claimed" JobId = "1.0" State = "Claimed" $ condor_q && condor_status -- Submitter: HOST : <IP:36111> : HOST ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 test 4/4 09:41 0+00:00:00 I 0 0.0 parallel.sh 3.0 test 4/4 09:41 0+00:00:21 I 0 0.0 parallel.sh 4.0 test 4/4 09:41 0+00:00:00 I 0 0.0 parallel.sh 3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended Name OpSys Arch State Activity LoadAv Mem ActvtyTime HOST1 LINUX INTEL Unclaimed Idle 0.300 502 0+00:00:04 HOST2 LINUX INTEL Unclaimed Idle 0.050 499 0+00:00:04 HOST3 LINUX X86_64 Unclaimed Idle 0.150 497 0+00:00:04 HOST4 LINUX X86_64 Unclaimed Idle 0.100 490 0+00:00:04 Machines Owner Claimed Unclaimed Matched Preempting INTEL/LINUX 2 0 0 2 0 0 X86_64/LINUX 2 0 0 2 0 0 Total 4 0 0 4 0 0 $ condor_q && condor_status -l | grep -i -e "^State" -e "^jobID" -e "^machine"s -- Submitter: HOST : <IP:36111> : HOST ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 test 4/4 09:41 0+00:00:00 I 0 0.0 parallel.sh 3.0 test 4/4 09:41 0+00:00:38 R 0 0.0 parallel.sh 4.0 test 4/4 09:41 0+00:00:00 I 0 0.0 parallel.sh 3 jobs; 0 completed, 0 removed, 2 idle, 1 running, 0 held, 0 suspended JobId = "3.0" State = "Claimed" JobId = "3.0" State = "Claimed" JobId = "3.0" State = "Claimed" JobId = "3.0" State = "Claimed" Actual results: Job is marked as RUNNING immediately after previous one is removed, but it isn't correctly running because slots are for a while occupied by the removed job. Expected results: Job isn't runned before all slots are correctly cleaned from previous job. Additional info: