Bug 673179

Summary: RFE: Make Schedd send updates on job remove
Product: Red Hat Enterprise MRG Reporter: Jan Sarenik <jsarenik>
Component: condorAssignee: Matthew Farrellee <matt>
Status: CLOSED NOTABUG QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.3CC: eerlands, iboverma, ltoscano, matt, tmckay
Target Milestone: 2.0Keywords: FutureFeature
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-02-24 12:44:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 634302    
Bug Blocks:    

Description Jan Sarenik 2011-01-27 16:44:14 UTC
See bug 595704 comment #25 and onwards.

condor-qmf-7.4.5-0.7.el5

How reproducible: 100%

Steps to Reproduce:
0. Have a clean Condor with clean pool in the beginning.
   QMF_UPDATE_INTERVAL = 5
   COLLECTOR_UPDATE_INTERVAL = 5
   (cumin and qpidd also set to update every 5 seconds)

1. Sumbit a simple job, e.g.
        condor_submit << EOF
        Executable     = /bin/sleep
        Universe = vanilla
        args    = 20m
        queue 1
        EOF
2. condor_rm -all
3. Go to Cumin -> Grid -> Overview and look at statistics
  
Actual results: Idle or Running stays at 1 for 5 minutes,
  which is the default SCHEDD_INTERVAL.

Expected results: Schedd should publish job count after
  remove event like it publishes it because of other events
  (e.g. job addition).

Comment 1 Matthew Farrellee 2011-01-27 16:50:08 UTC
I would expect condor_status -submitter/-sched to exhibit the same behavior.

Comment 2 Matthew Farrellee 2011-01-31 21:56:08 UTC
This is indeed visible from condor_status -schedd/-submitter as well. The Schedd publishes on SCHEDD_INTERVAL, at the end of a negotiation cycle, at a reconfig or on a reschedule request. Until a publish the information in the Collector may out stale, as well as the information in the QMF object space.

It is probably ok to tickle the Schedd to publish an update on remove, but may have scale implications. The publishing is done as part of a scan of the entire queue. However, the timeout() code has some protections to prevent processing the queue too frequently.

Let's turn this into an RFE for tickling the collector update.

Comment 4 Matthew Farrellee 2011-02-24 12:12:37 UTC
The Schedd also does not send an update when a job completes. This means the number of running jobs may be stale after a job exits.

Comment 5 Matthew Farrellee 2011-02-24 12:35:38 UTC
Additionally, the Schedd does not send an update when a job starts running.

Comment 6 Matthew Farrellee 2011-02-24 12:38:07 UTC
Also, the Schedd does not send an update when holding a job.

Comment 7 Matthew Farrellee 2011-02-24 12:44:13 UTC
There are many paths to a job changing state that do not result in an update to the Collector. Another not listed above is periodic expression evaluation.

Even though timeout() protects itself from rapid repeated calls, given an active Schedd, the calls will effectively make SCHEDD_INTERVAL = SCHEDD_MIN_INTERVAL. Instead of tickling timeout() for each such transition, I suggest setting SCHEDD_INTERVAL to a lower value, one that provides an acceptable lag for a deployment.

Wild speculation: SCHEDD_INTERVAL for small or medium sized deployments could be easily set to 30 (from 300). For large deployments, a shorter publish interval may impact Schedd throughput.