Bug 751082 - hold/release removes job from queue
Summary: hold/release removes job from queue
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: Development
Hardware: All
OS: All
high
high
Target Milestone: 2.1
: ---
Assignee: Timothy St. Clair
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-11-03 13:40 UTC by Martin Kudlej
Modified: 2012-02-08 10:32 UTC (History)
3 users (show)

Fixed In Version: condor-7.6.5-0.6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-01-27 19:12:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
central manager and scheduler configuration + remote configuration of pool + removed jobs (317.78 KB, application/x-gzip)
2011-11-03 13:43 UTC, Martin Kudlej
no flags Details
logs from windows execute node (759.32 KB, application/x-gzip)
2011-11-03 14:19 UTC, Martin Kudlej
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 708944 0 medium CLOSED hold/release removes job from queue 2021-02-22 00:41:40 UTC

Description Martin Kudlej 2011-11-03 13:40:27 UTC
Description of problem:
I've got Central Manager and Scheduler on RHEL and Execute node on Windows. I submit Windows jobs from Windows to CM. They start and then I run script which holds and releases every single job for 100 times.

@echo off
 
for /L %%B in (1,1,100) do (
  echo "ITERATION %%B"
 
  for /L %%A in (1,1,50) do (
 
    C:\condor\bin\condor_hold.exe  -name _hostname_ %%A
  )
 
  for /L %%A in (1,1,50) do (
    C:\condor\bin\condor_release.exe  -name _hostname_ %%A
  ) 
)

I've got running "watch condor_q" on CM so it runs condor_q every 2 seconds.

Condor removes some jobs from queue during hold and release.


Version-Release number of selected component (if applicable):
Windows, Linux: condor-7.6.5-0.4

How reproducible:
100%

Steps to Reproduce:
1. submit Windows jobs from Windows to Linux CM/Sched
2. wait till some jobs start to run
3. run hold/release script
  
Actual results:
Condor removes some jobs from queue during hold/release.

Expected results:
Condor will not remove jobs from queue during hold/release or just because some jobs are releasing or holding.

Comment 1 Martin Kudlej 2011-11-03 13:43:48 UTC
Created attachment 531561 [details]
central manager and scheduler configuration + remote configuration of pool + removed jobs

Comment 2 Timothy St. Clair 2011-11-03 14:03:54 UTC
Ref: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2577

iirc we have seen something like this in the past too.

Comment 3 Timothy St. Clair 2011-11-03 14:05:54 UTC
Potentially also related: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2565

Comment 4 Martin Kudlej 2011-11-03 14:19:36 UTC
Created attachment 531564 [details]
logs from windows execute node

Comment 7 Martin Kudlej 2011-11-04 14:09:12 UTC
It's RHEL 5.7. Sorry to not mention that before.

Comment 10 Martin Kudlej 2011-11-04 16:18:44 UTC
I've reproduce this also on pure Linux pool with CM and Sched on RHEL 5.7. I've used this loop:
 for i in `seq 100`; do for j in `seq 1 50`; do condor_hold -name mrg-qe-06.lab.eng.brq.redhat.com $j; done; sleep 60; for j in `seq 1 50`; do condor_release -name mrg-qe-06.lab.eng.brq.redhat.com $j; done; done;

with this jobs(1-50):
cmd = /bin/sleep 
arguments = 1000000 
requirements = (FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED) 
queue

Comment 12 Martin Kudlej 2011-11-10 08:53:03 UTC
It is more rare than before but condor still remove job from queue just by holding.
First job has been remove after ~ 6 hours of holding and removing of 50 jobs. 

I use condor-7.6.5-0.6 on RHEL5 and Windows execute nodes.

->ASSIGNED

Comment 23 Martin Kudlej 2011-11-22 07:34:28 UTC
I've tested this bug for many days with pure Linux pool and with mixed pool(Windows EXE, Linux CM) and I no job has removed. I think this bug is that rare that we can close this as VERIFIED. We have regression test for this so we will watch this.


Note You need to log in before you can comment on or make changes to this bug.