Bug 708944 - hold/release removes job from queue
Summary: hold/release removes job from queue
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: 2.0.1
: ---
Assignee: Timothy St. Clair
QA Contact: Martin Kudlej
URL:
Whiteboard:
Depends On:
Blocks: 723887
TreeView+ depends on / blocked
 
Reported: 2011-05-30 09:08 UTC by Martin Kudlej
Modified: 2018-11-14 11:58 UTC (History)
4 users (show)

Fixed In Version: condor-7.6.2-0.1
Doc Type: Bug Fix
Doc Text:
C: Quick hold+release of running jobs C: Jobs would accidentally be removed from the queue F: Detect use case to prevent accidental removal R: Jobs will remain in the queue.
Clone Of:
Environment:
Last Closed: 2011-09-07 16:44:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs and configuration where were jobs scheduled (1.28 MB, application/x-gzip)
2011-05-30 09:08 UTC, Martin Kudlej
no flags Details
logs and configuration from where were jobs submitted (2.59 MB, application/x-gzip)
2011-05-30 09:09 UTC, Martin Kudlej
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:1249 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Grid 2.0 security, bug fix and enhancement update 2011-09-07 16:40:45 UTC

Internal Links: 751082

Description Martin Kudlej 2011-05-30 09:08:26 UTC
Created attachment 501749 [details]
logs and configuration where were jobs scheduled

Description of problem:
I've executed test which hold/release 1000 long term jobs. After ~150 hold/release condor removes one job from queue. It doesn't depend on method(aviary/command api) of release/hold.

Version-Release number of selected component (if applicable):
condor-7.6.1-0.8

How reproducible:
100%

Steps to Reproduce:
1. install condor
2. submit 1000 long term jobs(for example "sleep 100000")
3. i = 1000
4. hold jobs and wait till all jobs are held
5. release jobs and wait till all jobs are released
6. i--
7. if i > 0 go to 4.
  
Actual results:
Condor removes one job from queue.

Expected results:
Condor will not remove any job from queue because of hold/release.

Additional info:
$ condor_history -l _removed_job_

Out = "/tmp/mrg_1.1.outsLgbv"
LastPublicClaimId = "<scheduler_host_ip:51056>#1306487254#1865#..."
LastRemoteHost = "slot10@scheduler_host"
LastJobStatus = 1
JobCurrentStartDate = 1306515444
ImageSize_RAW = 1988
Submission = "/bin/sleep 1000000"
ImageSize = 2000
Cmd = "/bin/sleep"
PeriodicRemove = false
Iwd = "/tmp"
LastReleaseReason = " "
PeriodicHold = false
NumJobMatches = 93
JobStatus = 3
EnteredCurrentStatus = 1306515746
ClusterId = 16303
ReleaseReason = " "
JobFinishedHookDone = 1306515746
RemoteUserCpu = 0.0
MinHosts = 1
JobUniverse = 5
StartdPrincipal = "unauthenticated@unmapped/10.34.33.58"
BytesSent = 0.0
PeriodicRelease = false
REQUIREMENTS = true
MachineAttrSlotWeight0 = 1
ShouldTransferFiles = "YES"
GlobalJobId = "scheduler_host#16303.0#1306497534"
LastRejMatchReason = "no match found"
DiskUsage = 35
WhenToTransferOutput = "ON_EXIT"
UserLog = "/tmp/mrg_1.1.logXkpvJ"
MaxHosts = 1
JobStartDate = 1306497634
LastJobLeaseRenewal = 1306515746
ProcId = 0
Err = "/tmp/mrg_1.1.errB0R9h"
OrigMaxHosts = 1
CurrentHosts = 0
BytesRecvd = 2394936.000000
LastHoldReason = " "
DiskUsage_RAW = 33
RemoteSysCpu = 0.0
OnExitRemove = true
AutoClusterAttrs = "ImageSize,JobUniverse,LastCheckpointPlatform,NumCkpts,JobStart,LastPeriodicCheckpoint,RequestCpus,RequestDisk,RequestMemory,Requirements,NiceUser,ConcurrencyLimits"
AutoClusterId = 3
TargetType = "Machine"
QDate = 1306497534
LastMatchTime = 1306515444
OnExitHold = false
ResidentSetSize = 0
JobPrio = 0
RemoteWallClockTime = 500.000000
Args = "1000000"
NumJobStarts = 93
NumShadowStarts = 93
CumulativeSlotTime = 500.000000
User = "condor@submit_host"
CurrentTime = time()
MachineAttrCpus0 = 1
JobRunCount = 93
LastVacateTime = 1306515746
LastRejMatchTime = 1306515069
MyType = "Job"
LastSuspensionTime = 0
JobLastStartDate = 1306515257
Owner = "condor"

Comment 1 Martin Kudlej 2011-05-30 09:09:20 UTC
Created attachment 501750 [details]
logs and configuration from where were jobs submitted

Comment 2 Matthew Farrellee 2011-05-31 13:55:21 UTC
Please attach reproducer script.

$ i=0; while [ true ]; do echo $i; i=$((i+1)); condor_hold -a; condor_release -a; date; condor_q| tail -n1; done
0
All jobs held.
All jobs released.
Tue May 31 08:53:29 EDT 2011
1001 jobs; 1001 idle, 0 running, 0 held
...
2256
All jobs held.
All jobs released.
Tue May 31 09:37:13 EDT 2011
1001 jobs; 1001 idle, 0 running, 0 held

Also running with sleeps between hold & release showing state transitions, but no removed jobs.

Comment 3 Matthew Farrellee 2011-05-31 14:57:25 UTC
$ i=0; while [ true ]; do date; echo $i; i=$((i+1)); condor_hold -a; sleep 1; condor_q | tail -n1; condor_release -a; sleep 1; condor_q| tail -n1; done
...
Tue May 31 10:55:06 EDT 2011
1106
All jobs held.
1000 jobs; 0 idle, 0 running, 1000 held
All jobs released.
1000 jobs; 1000 idle, 0 running, 0 held

Comment 4 Matthew Farrellee 2011-06-17 15:47:23 UTC
This missing step from the reproducer (thanks Tim) is to let jobs start running before the hold+(immediate)release.

Upstream: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2249

Comment 6 Timothy St. Clair 2011-06-23 13:02:59 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: Quick hold+release of running jobs 
C: Jobs would accidentally be removed from the queue
F: Detect use case to prevent accidental removal
R: Jobs will remain in the queue.

Comment 8 Martin Kudlej 2011-07-22 13:30:01 UTC
Tested on RHEL 5.7/6.1 x x86_64 with condor-7.6.3-0.2 and it works. --> VERIFIED

Comment 9 errata-xmlrpc 2011-09-07 16:44:04 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1249.html


Note You need to log in before you can comment on or make changes to this bug.