708944 – hold/release removes job from queue

Bug 708944 - hold/release removes job from queue

Summary: hold/release removes job from queue

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor
Sub Component:
Version:	1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	2.0.1
Target Release:	---
Assignee:	Timothy St. Clair
QA Contact:	Martin Kudlej
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	723887
TreeView+	depends on / blocked

Reported:	2011-05-30 09:08 UTC by Martin Kudlej
Modified:	2018-11-14 11:58 UTC (History)
CC List:	4 users (show)
Fixed In Version:	condor-7.6.2-0.1
Doc Type:	Bug Fix
Doc Text:	C: Quick hold+release of running jobs C: Jobs would accidentally be removed from the queue F: Detect use case to prevent accidental removal R: Jobs will remain in the queue.
Clone Of:
Environment:
Last Closed:	2011-09-07 16:44:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs and configuration where were jobs scheduled (1.28 MB, application/x-gzip) 2011-05-30 09:08 UTC, Martin Kudlej	no flags	Details
logs and configuration from where were jobs submitted (2.59 MB, application/x-gzip) 2011-05-30 09:09 UTC, Martin Kudlej	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:1249	0	normal	SHIPPED_LIVE	Moderate: Red Hat Enterprise MRG Grid 2.0 security, bug fix and enhancement update	2011-09-07 16:40:45 UTC

Internal Links: 751082

Description Martin Kudlej 2011-05-30 09:08:26 UTC

Created attachment 501749 [details]
logs and configuration where were jobs scheduled

Description of problem:
I've executed test which hold/release 1000 long term jobs. After ~150 hold/release condor removes one job from queue. It doesn't depend on method(aviary/command api) of release/hold.

Version-Release number of selected component (if applicable):
condor-7.6.1-0.8

How reproducible:
100%

Steps to Reproduce:
1. install condor
2. submit 1000 long term jobs(for example "sleep 100000")
3. i = 1000
4. hold jobs and wait till all jobs are held
5. release jobs and wait till all jobs are released
6. i--
7. if i > 0 go to 4.
  
Actual results:
Condor removes one job from queue.

Expected results:
Condor will not remove any job from queue because of hold/release.

Additional info:
$ condor_history -l _removed_job_

Out = "/tmp/mrg_1.1.outsLgbv"
LastPublicClaimId = "<scheduler_host_ip:51056>#1306487254#1865#..."
LastRemoteHost = "slot10@scheduler_host"
LastJobStatus = 1
JobCurrentStartDate = 1306515444
ImageSize_RAW = 1988
Submission = "/bin/sleep 1000000"
ImageSize = 2000
Cmd = "/bin/sleep"
PeriodicRemove = false
Iwd = "/tmp"
LastReleaseReason = " "
PeriodicHold = false
NumJobMatches = 93
JobStatus = 3
EnteredCurrentStatus = 1306515746
ClusterId = 16303
ReleaseReason = " "
JobFinishedHookDone = 1306515746
RemoteUserCpu = 0.0
MinHosts = 1
JobUniverse = 5
StartdPrincipal = "unauthenticated@unmapped/10.34.33.58"
BytesSent = 0.0
PeriodicRelease = false
REQUIREMENTS = true
MachineAttrSlotWeight0 = 1
ShouldTransferFiles = "YES"
GlobalJobId = "scheduler_host#16303.0#1306497534"
LastRejMatchReason = "no match found"
DiskUsage = 35
WhenToTransferOutput = "ON_EXIT"
UserLog = "/tmp/mrg_1.1.logXkpvJ"
MaxHosts = 1
JobStartDate = 1306497634
LastJobLeaseRenewal = 1306515746
ProcId = 0
Err = "/tmp/mrg_1.1.errB0R9h"
OrigMaxHosts = 1
CurrentHosts = 0
BytesRecvd = 2394936.000000
LastHoldReason = " "
DiskUsage_RAW = 33
RemoteSysCpu = 0.0
OnExitRemove = true
AutoClusterAttrs = "ImageSize,JobUniverse,LastCheckpointPlatform,NumCkpts,JobStart,LastPeriodicCheckpoint,RequestCpus,RequestDisk,RequestMemory,Requirements,NiceUser,ConcurrencyLimits"
AutoClusterId = 3
TargetType = "Machine"
QDate = 1306497534
LastMatchTime = 1306515444
OnExitHold = false
ResidentSetSize = 0
JobPrio = 0
RemoteWallClockTime = 500.000000
Args = "1000000"
NumJobStarts = 93
NumShadowStarts = 93
CumulativeSlotTime = 500.000000
User = "condor@submit_host"
CurrentTime = time()
MachineAttrCpus0 = 1
JobRunCount = 93
LastVacateTime = 1306515746
LastRejMatchTime = 1306515069
MyType = "Job"
LastSuspensionTime = 0
JobLastStartDate = 1306515257
Owner = "condor"

Comment 1 Martin Kudlej 2011-05-30 09:09:20 UTC

Created attachment 501750 [details]
logs and configuration from where were jobs submitted

Comment 2 Matthew Farrellee 2011-05-31 13:55:21 UTC

Please attach reproducer script.

$ i=0; while [ true ]; do echo $i; i=$((i+1)); condor_hold -a; condor_release -a; date; condor_q| tail -n1; done
0
All jobs held.
All jobs released.
Tue May 31 08:53:29 EDT 2011
1001 jobs; 1001 idle, 0 running, 0 held
...
2256
All jobs held.
All jobs released.
Tue May 31 09:37:13 EDT 2011
1001 jobs; 1001 idle, 0 running, 0 held

Also running with sleeps between hold & release showing state transitions, but no removed jobs.

Comment 3 Matthew Farrellee 2011-05-31 14:57:25 UTC

$ i=0; while [ true ]; do date; echo $i; i=$((i+1)); condor_hold -a; sleep 1; condor_q | tail -n1; condor_release -a; sleep 1; condor_q| tail -n1; done
...
Tue May 31 10:55:06 EDT 2011
1106
All jobs held.
1000 jobs; 0 idle, 0 running, 1000 held
All jobs released.
1000 jobs; 1000 idle, 0 running, 0 held

Comment 4 Matthew Farrellee 2011-06-17 15:47:23 UTC

This missing step from the reproducer (thanks Tim) is to let jobs start running before the hold+(immediate)release.

Upstream: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2249

Comment 6 Timothy St. Clair 2011-06-23 13:02:59 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: Quick hold+release of running jobs 
C: Jobs would accidentally be removed from the queue
F: Detect use case to prevent accidental removal
R: Jobs will remain in the queue.

Comment 8 Martin Kudlej 2011-07-22 13:30:01 UTC

Tested on RHEL 5.7/6.1 x x86_64 with condor-7.6.3-0.2 and it works. --> VERIFIED

Comment 9 errata-xmlrpc 2011-09-07 16:44:04 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1249.html

Note You need to log in before you can comment on or make changes to this bug.