751082 – hold/release removes job from queue

Bug 751082 - hold/release removes job from queue

Summary: hold/release removes job from queue

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor
Sub Component:
Version:	Development
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	2.1
Target Release:	---
Assignee:	Timothy St. Clair
QA Contact:	MRG Quality Engineering
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-11-03 13:40 UTC by Martin Kudlej
Modified:	2012-02-08 10:32 UTC (History)
CC List:	3 users (show)
Fixed In Version:	condor-7.6.5-0.6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-01-27 19:12:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
central manager and scheduler configuration + remote configuration of pool + removed jobs (317.78 KB, application/x-gzip) 2011-11-03 13:43 UTC, Martin Kudlej	no flags	Details
logs from windows execute node (759.32 KB, application/x-gzip) 2011-11-03 14:19 UTC, Martin Kudlej	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	708944	0	medium	CLOSED	hold/release removes job from queue	2021-02-22 00:41:40 UTC

Description Martin Kudlej 2011-11-03 13:40:27 UTC

Description of problem:
I've got Central Manager and Scheduler on RHEL and Execute node on Windows. I submit Windows jobs from Windows to CM. They start and then I run script which holds and releases every single job for 100 times.

@echo off
 
for /L %%B in (1,1,100) do (
  echo "ITERATION %%B"
 
  for /L %%A in (1,1,50) do (
 
    C:\condor\bin\condor_hold.exe  -name _hostname_ %%A
  )
 
  for /L %%A in (1,1,50) do (
    C:\condor\bin\condor_release.exe  -name _hostname_ %%A
  ) 
)

I've got running "watch condor_q" on CM so it runs condor_q every 2 seconds.

Condor removes some jobs from queue during hold and release.


Version-Release number of selected component (if applicable):
Windows, Linux: condor-7.6.5-0.4

How reproducible:
100%

Steps to Reproduce:
1. submit Windows jobs from Windows to Linux CM/Sched
2. wait till some jobs start to run
3. run hold/release script
  
Actual results:
Condor removes some jobs from queue during hold/release.

Expected results:
Condor will not remove jobs from queue during hold/release or just because some jobs are releasing or holding.

Comment 1 Martin Kudlej 2011-11-03 13:43:48 UTC

Created attachment 531561 [details]
central manager and scheduler configuration + remote configuration of pool + removed jobs

Comment 2 Timothy St. Clair 2011-11-03 14:03:54 UTC

Ref: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2577

iirc we have seen something like this in the past too.

Comment 3 Timothy St. Clair 2011-11-03 14:05:54 UTC

Potentially also related: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2565

Comment 4 Martin Kudlej 2011-11-03 14:19:36 UTC

Created attachment 531564 [details]
logs from windows execute node

Comment 7 Martin Kudlej 2011-11-04 14:09:12 UTC

It's RHEL 5.7. Sorry to not mention that before.

Comment 10 Martin Kudlej 2011-11-04 16:18:44 UTC

I've reproduce this also on pure Linux pool with CM and Sched on RHEL 5.7. I've used this loop:
 for i in `seq 100`; do for j in `seq 1 50`; do condor_hold -name mrg-qe-06.lab.eng.brq.redhat.com $j; done; sleep 60; for j in `seq 1 50`; do condor_release -name mrg-qe-06.lab.eng.brq.redhat.com $j; done; done;

with this jobs(1-50):
cmd = /bin/sleep 
arguments = 1000000 
requirements = (FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED) 
queue

Comment 12 Martin Kudlej 2011-11-10 08:53:03 UTC

It is more rare than before but condor still remove job from queue just by holding.
First job has been remove after ~ 6 hours of holding and removing of 50 jobs. 

I use condor-7.6.5-0.6 on RHEL5 and Windows execute nodes.

->ASSIGNED

Comment 23 Martin Kudlej 2011-11-22 07:34:28 UTC

I've tested this bug for many days with pure Linux pool and with mixed pool(Windows EXE, Linux CM) and I no job has removed. I think this bug is that rare that we can close this as VERIFIED. We have regression test for this so we will watch this.

Note You need to log in before you can comment on or make changes to this bug.