src/management/qmfprobe.py script that queries all plugin objects doesn't have expected counts from scheduler and submitter objects for job counts
Seems to be a problem with: a) idle counts - we can get -1 to start from the schedd-set attr after a py submit b) submitter thinks there is still 1 job running after all have completed Need to test this as restart then condor_submit instead of py submit
These counts are actually coming from the UPDATE_SCHEDD_ADS and UPDATE_SUBMITTOR_ADS. QMF plugin just directly updates whatever it gets from the schedd. The counts are off by 1 at both ends... ~/personal-condor/log $ grep -e "IdleJobs" -e "RunningJobs" SchedLog TotalIdleJobs = 0 TotalRunningJobs = 0 TotalIdleJobs = 3 TotalRunningJobs = 0 06/07 17:42:04 Changed attribute: RunningJobs = 0 06/07 17:42:04 Changed attribute: IdleJobs = 3 RunningJobs = 0 IdleJobs = 3 TotalIdleJobs = 1 TotalRunningJobs = 2 06/07 17:47:05 Changed attribute: RunningJobs = 2 06/07 17:47:05 Changed attribute: IdleJobs = 1 RunningJobs = 2 IdleJobs = 1 TotalIdleJobs = 0 TotalRunningJobs = 1 06/07 17:47:25 Changed attribute: RunningJobs = 1 06/07 17:47:25 Changed attribute: IdleJobs = 0 RunningJobs = 1 IdleJobs = 0 TotalIdleJobs = 0 TotalRunningJobs = 1 06/07 17:52:25 Changed attribute: RunningJobs = 1 06/07 17:52:25 Changed attribute: IdleJobs = 0 RunningJobs = 1 IdleJobs = 0 TotalIdleJobs = 0 TotalRunningJobs = 0 TotalIdleJobs = 0 TotalRunningJobs = 0 TotalIdleJobs = 0 TotalRunningJobs = 0 When we are really 2R/1I the update doesn't change from 3I. Then for a period of time we are 3C and it still thinks 1R. Matt, thoughts?
Thought - You didn't want long enough for an update that showed 0R,0I. Does condor_status -sched already report the 1R after all are complete (shown via condor_q | tail -n1)? The SCHEDD&SUBMITTER updates may be delayed when there are no jobs to report, which may be the wrong semantic, e.g. don't report on no change instead.
Lowering the SCHEDD_INTERVAL from the 5 min default certainly improved this. However, we never see a final updated submitter ad (ie., 0 jobs running). The last one claims there is 1 job running and that is what we are left with.
If that can be verified by looking at condor_status -submitter then it's a candidate for fixing. IIRC, submitter ads are generated from jobs in the queue. If there are no jobs for a submitter (all completed) I could imagine the Schedd just wouldn't know to send a final update (an invalidate!).
Fixed for incorrect idle job stats on the scheduler and submitter (needed to augment the inbound classad a bit). Now we need a solution for the missing UPDATE_SUBMITTER_AD to update the submitter objects. Also I see this: ~/personal-condor/log $ condor_status -submitter Name Machine Running IdleJobs HeldJobs nobody localhost. 0 0 [???????] RunningJobs IdleJobs HeldJobs nobody 0 0 0 Total 0 0 0 (Omitted 1 malformed ads in computed attribute totals)
We were missing a plugin update when we walk the owner list and there are no jobs. The collectors were getting this submitter update already - just needed to do the same for the schedd plugins also. FH 29c3f20c2ea