Bug 589660

Summary: QMF: Job status stats incorrect on scheduler and submitter objects
Product: Red Hat Enterprise MRG Reporter: Pete MacKinnon <pmackinn>
Component: condorAssignee: Pete MacKinnon <pmackinn>
Status: CLOSED CURRENTRELEASE QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: high Docs Contact:
Priority: medium    
Version: DevelopmentCC: iboverma, matt
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-07-22 17:18:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Pete MacKinnon 2010-05-06 15:57:49 UTC
src/management/qmfprobe.py script that queries all plugin objects doesn't have expected counts from scheduler and submitter objects for job counts

Comment 1 Pete MacKinnon 2010-05-19 14:28:39 UTC
Seems to be a problem with:
a) idle counts - we can get -1 to start from the schedd-set attr after a py submit
b) submitter thinks there is still 1 job running after all have completed

Need to test this as restart then condor_submit instead of py submit

Comment 2 Pete MacKinnon 2010-06-07 22:11:53 UTC
These counts are actually coming from the UPDATE_SCHEDD_ADS and UPDATE_SUBMITTOR_ADS. QMF plugin just directly updates whatever it gets from the schedd. The counts are off by 1 at both ends...

~/personal-condor/log  $ grep -e "IdleJobs" -e "RunningJobs" SchedLog 
TotalIdleJobs = 0
TotalRunningJobs = 0
TotalIdleJobs = 3
TotalRunningJobs = 0
06/07 17:42:04 Changed attribute: RunningJobs = 0
06/07 17:42:04 Changed attribute: IdleJobs = 3
RunningJobs = 0
IdleJobs = 3
TotalIdleJobs = 1
TotalRunningJobs = 2
06/07 17:47:05 Changed attribute: RunningJobs = 2
06/07 17:47:05 Changed attribute: IdleJobs = 1
RunningJobs = 2
IdleJobs = 1
TotalIdleJobs = 0
TotalRunningJobs = 1
06/07 17:47:25 Changed attribute: RunningJobs = 1
06/07 17:47:25 Changed attribute: IdleJobs = 0
RunningJobs = 1
IdleJobs = 0
TotalIdleJobs = 0
TotalRunningJobs = 1
06/07 17:52:25 Changed attribute: RunningJobs = 1
06/07 17:52:25 Changed attribute: IdleJobs = 0
RunningJobs = 1
IdleJobs = 0
TotalIdleJobs = 0
TotalRunningJobs = 0
TotalIdleJobs = 0
TotalRunningJobs = 0
TotalIdleJobs = 0
TotalRunningJobs = 0

When we are really 2R/1I the update doesn't change from 3I. Then for a period of time we are 3C and it still thinks 1R.

Matt, thoughts?

Comment 3 Matthew Farrellee 2010-06-08 14:58:52 UTC
Thought -
 You didn't want long enough for an update that showed 0R,0I. Does condor_status -sched already report the 1R after all are complete (shown via condor_q | tail -n1)? The SCHEDD&SUBMITTER updates may be delayed when there are no jobs to report, which may be the wrong semantic, e.g. don't report on no change instead.

Comment 4 Pete MacKinnon 2010-06-08 21:58:08 UTC
Lowering the SCHEDD_INTERVAL from the 5 min default certainly improved this. However, we never see a final updated submitter ad (ie., 0 jobs running). The last one claims there is 1 job running and that is what we are left with.

Comment 5 Matthew Farrellee 2010-06-09 10:29:19 UTC
If that can be verified by looking at condor_status -submitter then it's a candidate for fixing. IIRC, submitter ads are generated from jobs in the queue. If there are no jobs for a submitter (all completed) I could imagine the Schedd just wouldn't know to send a final update (an invalidate!).

Comment 6 Pete MacKinnon 2010-06-17 16:22:25 UTC
Fixed for incorrect idle job stats on the scheduler and submitter (needed to augment the inbound classad a bit). Now we need a solution for the missing UPDATE_SUBMITTER_AD to update the submitter objects. Also I see this:

~/personal-condor/log  $ condor_status -submitter

Name                 Machine      Running IdleJobs HeldJobs

nobody@redhat.com    localhost.         0        0 [???????]
                           RunningJobs           IdleJobs           HeldJobs

   nobody@redhat.com                 0                  0                  0

               Total                 0                  0                  0

                    (Omitted 1 malformed ads in computed attribute totals)

Comment 7 Pete MacKinnon 2010-06-18 02:15:51 UTC
We were missing a plugin update when we walk the owner list and there are no jobs. The collectors were getting this submitter update already - just needed to do the same for the schedd plugins also.

FH 29c3f20c2ea