See bug 595704 comment #25 and onwards. condor-qmf-7.4.5-0.7.el5 How reproducible: 100% Steps to Reproduce: 0. Have a clean Condor with clean pool in the beginning. QMF_UPDATE_INTERVAL = 5 COLLECTOR_UPDATE_INTERVAL = 5 (cumin and qpidd also set to update every 5 seconds) 1. Sumbit a simple job, e.g. condor_submit << EOF Executable = /bin/sleep Universe = vanilla args = 20m queue 1 EOF 2. condor_rm -all 3. Go to Cumin -> Grid -> Overview and look at statistics Actual results: Idle or Running stays at 1 for 5 minutes, which is the default SCHEDD_INTERVAL. Expected results: Schedd should publish job count after remove event like it publishes it because of other events (e.g. job addition).
I would expect condor_status -submitter/-sched to exhibit the same behavior.
This is indeed visible from condor_status -schedd/-submitter as well. The Schedd publishes on SCHEDD_INTERVAL, at the end of a negotiation cycle, at a reconfig or on a reschedule request. Until a publish the information in the Collector may out stale, as well as the information in the QMF object space. It is probably ok to tickle the Schedd to publish an update on remove, but may have scale implications. The publishing is done as part of a scan of the entire queue. However, the timeout() code has some protections to prevent processing the queue too frequently. Let's turn this into an RFE for tickling the collector update.
The Schedd also does not send an update when a job completes. This means the number of running jobs may be stale after a job exits.
Additionally, the Schedd does not send an update when a job starts running.
Also, the Schedd does not send an update when holding a job.
There are many paths to a job changing state that do not result in an update to the Collector. Another not listed above is periodic expression evaluation. Even though timeout() protects itself from rapid repeated calls, given an active Schedd, the calls will effectively make SCHEDD_INTERVAL = SCHEDD_MIN_INTERVAL. Instead of tickling timeout() for each such transition, I suggest setting SCHEDD_INTERVAL to a lower value, one that provides an acceptable lag for a deployment. Wild speculation: SCHEDD_INTERVAL for small or medium sized deployments could be easily set to 30 (from 300). For large deployments, a shorter publish interval may impact Schedd throughput.