Created attachment 501749 [details] logs and configuration where were jobs scheduled Description of problem: I've executed test which hold/release 1000 long term jobs. After ~150 hold/release condor removes one job from queue. It doesn't depend on method(aviary/command api) of release/hold. Version-Release number of selected component (if applicable): condor-7.6.1-0.8 How reproducible: 100% Steps to Reproduce: 1. install condor 2. submit 1000 long term jobs(for example "sleep 100000") 3. i = 1000 4. hold jobs and wait till all jobs are held 5. release jobs and wait till all jobs are released 6. i-- 7. if i > 0 go to 4. Actual results: Condor removes one job from queue. Expected results: Condor will not remove any job from queue because of hold/release. Additional info: $ condor_history -l _removed_job_ Out = "/tmp/mrg_1.1.outsLgbv" LastPublicClaimId = "<scheduler_host_ip:51056>#1306487254#1865#..." LastRemoteHost = "slot10@scheduler_host" LastJobStatus = 1 JobCurrentStartDate = 1306515444 ImageSize_RAW = 1988 Submission = "/bin/sleep 1000000" ImageSize = 2000 Cmd = "/bin/sleep" PeriodicRemove = false Iwd = "/tmp" LastReleaseReason = " " PeriodicHold = false NumJobMatches = 93 JobStatus = 3 EnteredCurrentStatus = 1306515746 ClusterId = 16303 ReleaseReason = " " JobFinishedHookDone = 1306515746 RemoteUserCpu = 0.0 MinHosts = 1 JobUniverse = 5 StartdPrincipal = "unauthenticated@unmapped/10.34.33.58" BytesSent = 0.0 PeriodicRelease = false REQUIREMENTS = true MachineAttrSlotWeight0 = 1 ShouldTransferFiles = "YES" GlobalJobId = "scheduler_host#16303.0#1306497534" LastRejMatchReason = "no match found" DiskUsage = 35 WhenToTransferOutput = "ON_EXIT" UserLog = "/tmp/mrg_1.1.logXkpvJ" MaxHosts = 1 JobStartDate = 1306497634 LastJobLeaseRenewal = 1306515746 ProcId = 0 Err = "/tmp/mrg_1.1.errB0R9h" OrigMaxHosts = 1 CurrentHosts = 0 BytesRecvd = 2394936.000000 LastHoldReason = " " DiskUsage_RAW = 33 RemoteSysCpu = 0.0 OnExitRemove = true AutoClusterAttrs = "ImageSize,JobUniverse,LastCheckpointPlatform,NumCkpts,JobStart,LastPeriodicCheckpoint,RequestCpus,RequestDisk,RequestMemory,Requirements,NiceUser,ConcurrencyLimits" AutoClusterId = 3 TargetType = "Machine" QDate = 1306497534 LastMatchTime = 1306515444 OnExitHold = false ResidentSetSize = 0 JobPrio = 0 RemoteWallClockTime = 500.000000 Args = "1000000" NumJobStarts = 93 NumShadowStarts = 93 CumulativeSlotTime = 500.000000 User = "condor@submit_host" CurrentTime = time() MachineAttrCpus0 = 1 JobRunCount = 93 LastVacateTime = 1306515746 LastRejMatchTime = 1306515069 MyType = "Job" LastSuspensionTime = 0 JobLastStartDate = 1306515257 Owner = "condor"
Created attachment 501750 [details] logs and configuration from where were jobs submitted
Please attach reproducer script. $ i=0; while [ true ]; do echo $i; i=$((i+1)); condor_hold -a; condor_release -a; date; condor_q| tail -n1; done 0 All jobs held. All jobs released. Tue May 31 08:53:29 EDT 2011 1001 jobs; 1001 idle, 0 running, 0 held ... 2256 All jobs held. All jobs released. Tue May 31 09:37:13 EDT 2011 1001 jobs; 1001 idle, 0 running, 0 held Also running with sleeps between hold & release showing state transitions, but no removed jobs.
$ i=0; while [ true ]; do date; echo $i; i=$((i+1)); condor_hold -a; sleep 1; condor_q | tail -n1; condor_release -a; sleep 1; condor_q| tail -n1; done ... Tue May 31 10:55:06 EDT 2011 1106 All jobs held. 1000 jobs; 0 idle, 0 running, 1000 held All jobs released. 1000 jobs; 1000 idle, 0 running, 0 held
This missing step from the reproducer (thanks Tim) is to let jobs start running before the hold+(immediate)release. Upstream: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2249
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: C: Quick hold+release of running jobs C: Jobs would accidentally be removed from the queue F: Detect use case to prevent accidental removal R: Jobs will remain in the queue.
Tested on RHEL 5.7/6.1 x x86_64 with condor-7.6.3-0.2 and it works. --> VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1249.html