Bug 751082

Summary: hold/release removes job from queue
Product: Red Hat Enterprise MRG Reporter: Martin Kudlej <mkudlej>
Component: condorAssignee: Timothy St. Clair <tstclair>
Status: CLOSED ERRATA QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: high Docs Contact:
Priority: high    
Version: DevelopmentCC: jneedle, matt, tstclair
Target Milestone: 2.1Keywords: Regression
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: condor-7.6.5-0.6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-01-27 19:12:51 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
central manager and scheduler configuration + remote configuration of pool + removed jobs
none
logs from windows execute node none

Description Martin Kudlej 2011-11-03 13:40:27 UTC
Description of problem:
I've got Central Manager and Scheduler on RHEL and Execute node on Windows. I submit Windows jobs from Windows to CM. They start and then I run script which holds and releases every single job for 100 times.

@echo off
 
for /L %%B in (1,1,100) do (
  echo "ITERATION %%B"
 
  for /L %%A in (1,1,50) do (
 
    C:\condor\bin\condor_hold.exe  -name _hostname_ %%A
  )
 
  for /L %%A in (1,1,50) do (
    C:\condor\bin\condor_release.exe  -name _hostname_ %%A
  ) 
)

I've got running "watch condor_q" on CM so it runs condor_q every 2 seconds.

Condor removes some jobs from queue during hold and release.


Version-Release number of selected component (if applicable):
Windows, Linux: condor-7.6.5-0.4

How reproducible:
100%

Steps to Reproduce:
1. submit Windows jobs from Windows to Linux CM/Sched
2. wait till some jobs start to run
3. run hold/release script
  
Actual results:
Condor removes some jobs from queue during hold/release.

Expected results:
Condor will not remove jobs from queue during hold/release or just because some jobs are releasing or holding.

Comment 1 Martin Kudlej 2011-11-03 13:43:48 UTC
Created attachment 531561 [details]
central manager and scheduler configuration + remote configuration of pool + removed jobs

Comment 2 Timothy St. Clair 2011-11-03 14:03:54 UTC
Ref: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2577

iirc we have seen something like this in the past too.

Comment 3 Timothy St. Clair 2011-11-03 14:05:54 UTC
Potentially also related: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2565

Comment 4 Martin Kudlej 2011-11-03 14:19:36 UTC
Created attachment 531564 [details]
logs from windows execute node

Comment 7 Martin Kudlej 2011-11-04 14:09:12 UTC
It's RHEL 5.7. Sorry to not mention that before.

Comment 10 Martin Kudlej 2011-11-04 16:18:44 UTC
I've reproduce this also on pure Linux pool with CM and Sched on RHEL 5.7. I've used this loop:
 for i in `seq 100`; do for j in `seq 1 50`; do condor_hold -name mrg-qe-06.lab.eng.brq.redhat.com $j; done; sleep 60; for j in `seq 1 50`; do condor_release -name mrg-qe-06.lab.eng.brq.redhat.com $j; done; done;

with this jobs(1-50):
cmd = /bin/sleep 
arguments = 1000000 
requirements = (FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED) 
queue

Comment 12 Martin Kudlej 2011-11-10 08:53:03 UTC
It is more rare than before but condor still remove job from queue just by holding.
First job has been remove after ~ 6 hours of holding and removing of 50 jobs. 

I use condor-7.6.5-0.6 on RHEL5 and Windows execute nodes.

->ASSIGNED

Comment 23 Martin Kudlej 2011-11-22 07:34:28 UTC
I've tested this bug for many days with pure Linux pool and with mixed pool(Windows EXE, Linux CM) and I no job has removed. I think this bug is that rare that we can close this as VERIFIED. We have regression test for this so we will watch this.