Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 673179

Summary:	RFE: Make Schedd send updates on job remove
Product:	Red Hat Enterprise MRG	Reporter:	Jan Sarenik <jsarenik>
Component:	condor	Assignee:	Matthew Farrellee <matt>
Status:	CLOSED NOTABUG	QA Contact:	MRG Quality Engineering <mrgqe-bugs>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	1.3	CC:	eerlands, iboverma, ltoscano, matt, tmckay
Target Milestone:	2.0	Keywords:	FutureFeature
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-02-24 12:44:13 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	634302
Bug Blocks:

Description Jan Sarenik 2011-01-27 16:44:14 UTC

See bug 595704 comment #25 and onwards.

condor-qmf-7.4.5-0.7.el5

How reproducible: 100%

Steps to Reproduce:
0. Have a clean Condor with clean pool in the beginning.
   QMF_UPDATE_INTERVAL = 5
   COLLECTOR_UPDATE_INTERVAL = 5
   (cumin and qpidd also set to update every 5 seconds)

1. Sumbit a simple job, e.g.
        condor_submit << EOF
        Executable     = /bin/sleep
        Universe = vanilla
        args    = 20m
        queue 1
        EOF
2. condor_rm -all
3. Go to Cumin -> Grid -> Overview and look at statistics
  
Actual results: Idle or Running stays at 1 for 5 minutes,
  which is the default SCHEDD_INTERVAL.

Expected results: Schedd should publish job count after
  remove event like it publishes it because of other events
  (e.g. job addition).

Comment 1 Matthew Farrellee 2011-01-27 16:50:08 UTC

I would expect condor_status -submitter/-sched to exhibit the same behavior.

Comment 2 Matthew Farrellee 2011-01-31 21:56:08 UTC

This is indeed visible from condor_status -schedd/-submitter as well. The Schedd publishes on SCHEDD_INTERVAL, at the end of a negotiation cycle, at a reconfig or on a reschedule request. Until a publish the information in the Collector may out stale, as well as the information in the QMF object space.

It is probably ok to tickle the Schedd to publish an update on remove, but may have scale implications. The publishing is done as part of a scan of the entire queue. However, the timeout() code has some protections to prevent processing the queue too frequently.

Let's turn this into an RFE for tickling the collector update.

Comment 4 Matthew Farrellee 2011-02-24 12:12:37 UTC

The Schedd also does not send an update when a job completes. This means the number of running jobs may be stale after a job exits.

Comment 5 Matthew Farrellee 2011-02-24 12:35:38 UTC

Additionally, the Schedd does not send an update when a job starts running.

Comment 6 Matthew Farrellee 2011-02-24 12:38:07 UTC

Also, the Schedd does not send an update when holding a job.

Comment 7 Matthew Farrellee 2011-02-24 12:44:13 UTC

There are many paths to a job changing state that do not result in an update to the Collector. Another not listed above is periodic expression evaluation.

Even though timeout() protects itself from rapid repeated calls, given an active Schedd, the calls will effectively make SCHEDD_INTERVAL = SCHEDD_MIN_INTERVAL. Instead of tickling timeout() for each such transition, I suggest setting SCHEDD_INTERVAL to a lower value, one that provides an acceptable lag for a deployment.

Wild speculation: SCHEDD_INTERVAL for small or medium sized deployments could be easily set to 30 (from 300). For large deployments, a shorter publish interval may impact Schedd throughput.