Description of problem: I've got Central Manager and Scheduler on RHEL and Execute node on Windows. I submit Windows jobs from Windows to CM. They start and then I run script which holds and releases every single job for 100 times. @echo off for /L %%B in (1,1,100) do ( echo "ITERATION %%B" for /L %%A in (1,1,50) do ( C:\condor\bin\condor_hold.exe -name _hostname_ %%A ) for /L %%A in (1,1,50) do ( C:\condor\bin\condor_release.exe -name _hostname_ %%A ) ) I've got running "watch condor_q" on CM so it runs condor_q every 2 seconds. Condor removes some jobs from queue during hold and release. Version-Release number of selected component (if applicable): Windows, Linux: condor-7.6.5-0.4 How reproducible: 100% Steps to Reproduce: 1. submit Windows jobs from Windows to Linux CM/Sched 2. wait till some jobs start to run 3. run hold/release script Actual results: Condor removes some jobs from queue during hold/release. Expected results: Condor will not remove jobs from queue during hold/release or just because some jobs are releasing or holding.
Created attachment 531561 [details] central manager and scheduler configuration + remote configuration of pool + removed jobs
Ref: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2577 iirc we have seen something like this in the past too.
Potentially also related: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2565
Created attachment 531564 [details] logs from windows execute node
It's RHEL 5.7. Sorry to not mention that before.
I've reproduce this also on pure Linux pool with CM and Sched on RHEL 5.7. I've used this loop: for i in `seq 100`; do for j in `seq 1 50`; do condor_hold -name mrg-qe-06.lab.eng.brq.redhat.com $j; done; sleep 60; for j in `seq 1 50`; do condor_release -name mrg-qe-06.lab.eng.brq.redhat.com $j; done; done; with this jobs(1-50): cmd = /bin/sleep arguments = 1000000 requirements = (FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED) queue
It is more rare than before but condor still remove job from queue just by holding. First job has been remove after ~ 6 hours of holding and removing of 50 jobs. I use condor-7.6.5-0.6 on RHEL5 and Windows execute nodes. ->ASSIGNED
I've tested this bug for many days with pure Linux pool and with mixed pool(Windows EXE, Linux CM) and I no job has removed. I think this bug is that rare that we can close this as VERIFIED. We have regression test for this so we will watch this.