Bug 673179 - RFE: Make Schedd send updates on job remove
Summary: RFE: Make Schedd send updates on job remove
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.3
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: 2.0
: ---
Assignee: Matthew Farrellee
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On: 634302
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-01-27 16:44 UTC by Jan Sarenik
Modified: 2011-02-24 12:44 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-02-24 12:44:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jan Sarenik 2011-01-27 16:44:14 UTC
See bug 595704 comment #25 and onwards.

condor-qmf-7.4.5-0.7.el5

How reproducible: 100%

Steps to Reproduce:
0. Have a clean Condor with clean pool in the beginning.
   QMF_UPDATE_INTERVAL = 5
   COLLECTOR_UPDATE_INTERVAL = 5
   (cumin and qpidd also set to update every 5 seconds)

1. Sumbit a simple job, e.g.
        condor_submit << EOF
        Executable     = /bin/sleep
        Universe = vanilla
        args    = 20m
        queue 1
        EOF
2. condor_rm -all
3. Go to Cumin -> Grid -> Overview and look at statistics
  
Actual results: Idle or Running stays at 1 for 5 minutes,
  which is the default SCHEDD_INTERVAL.

Expected results: Schedd should publish job count after
  remove event like it publishes it because of other events
  (e.g. job addition).

Comment 1 Matthew Farrellee 2011-01-27 16:50:08 UTC
I would expect condor_status -submitter/-sched to exhibit the same behavior.

Comment 2 Matthew Farrellee 2011-01-31 21:56:08 UTC
This is indeed visible from condor_status -schedd/-submitter as well. The Schedd publishes on SCHEDD_INTERVAL, at the end of a negotiation cycle, at a reconfig or on a reschedule request. Until a publish the information in the Collector may out stale, as well as the information in the QMF object space.

It is probably ok to tickle the Schedd to publish an update on remove, but may have scale implications. The publishing is done as part of a scan of the entire queue. However, the timeout() code has some protections to prevent processing the queue too frequently.

Let's turn this into an RFE for tickling the collector update.

Comment 4 Matthew Farrellee 2011-02-24 12:12:37 UTC
The Schedd also does not send an update when a job completes. This means the number of running jobs may be stale after a job exits.

Comment 5 Matthew Farrellee 2011-02-24 12:35:38 UTC
Additionally, the Schedd does not send an update when a job starts running.

Comment 6 Matthew Farrellee 2011-02-24 12:38:07 UTC
Also, the Schedd does not send an update when holding a job.

Comment 7 Matthew Farrellee 2011-02-24 12:44:13 UTC
There are many paths to a job changing state that do not result in an update to the Collector. Another not listed above is periodic expression evaluation.

Even though timeout() protects itself from rapid repeated calls, given an active Schedd, the calls will effectively make SCHEDD_INTERVAL = SCHEDD_MIN_INTERVAL. Instead of tickling timeout() for each such transition, I suggest setting SCHEDD_INTERVAL to a lower value, one that provides an acceptable lag for a deployment.

Wild speculation: SCHEDD_INTERVAL for small or medium sized deployments could be easily set to 30 (from 300). For large deployments, a shorter publish interval may impact Schedd throughput.


Note You need to log in before you can comment on or make changes to this bug.