Bug 589660
Summary: | QMF: Job status stats incorrect on scheduler and submitter objects | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Pete MacKinnon <pmackinn> |
Component: | condor | Assignee: | Pete MacKinnon <pmackinn> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | MRG Quality Engineering <mrgqe-bugs> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | Development | CC: | iboverma, matt |
Target Milestone: | 1.3 | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2010-07-22 17:18:47 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Pete MacKinnon
2010-05-06 15:57:49 UTC
Seems to be a problem with: a) idle counts - we can get -1 to start from the schedd-set attr after a py submit b) submitter thinks there is still 1 job running after all have completed Need to test this as restart then condor_submit instead of py submit These counts are actually coming from the UPDATE_SCHEDD_ADS and UPDATE_SUBMITTOR_ADS. QMF plugin just directly updates whatever it gets from the schedd. The counts are off by 1 at both ends... ~/personal-condor/log $ grep -e "IdleJobs" -e "RunningJobs" SchedLog TotalIdleJobs = 0 TotalRunningJobs = 0 TotalIdleJobs = 3 TotalRunningJobs = 0 06/07 17:42:04 Changed attribute: RunningJobs = 0 06/07 17:42:04 Changed attribute: IdleJobs = 3 RunningJobs = 0 IdleJobs = 3 TotalIdleJobs = 1 TotalRunningJobs = 2 06/07 17:47:05 Changed attribute: RunningJobs = 2 06/07 17:47:05 Changed attribute: IdleJobs = 1 RunningJobs = 2 IdleJobs = 1 TotalIdleJobs = 0 TotalRunningJobs = 1 06/07 17:47:25 Changed attribute: RunningJobs = 1 06/07 17:47:25 Changed attribute: IdleJobs = 0 RunningJobs = 1 IdleJobs = 0 TotalIdleJobs = 0 TotalRunningJobs = 1 06/07 17:52:25 Changed attribute: RunningJobs = 1 06/07 17:52:25 Changed attribute: IdleJobs = 0 RunningJobs = 1 IdleJobs = 0 TotalIdleJobs = 0 TotalRunningJobs = 0 TotalIdleJobs = 0 TotalRunningJobs = 0 TotalIdleJobs = 0 TotalRunningJobs = 0 When we are really 2R/1I the update doesn't change from 3I. Then for a period of time we are 3C and it still thinks 1R. Matt, thoughts? Thought - You didn't want long enough for an update that showed 0R,0I. Does condor_status -sched already report the 1R after all are complete (shown via condor_q | tail -n1)? The SCHEDD&SUBMITTER updates may be delayed when there are no jobs to report, which may be the wrong semantic, e.g. don't report on no change instead. Lowering the SCHEDD_INTERVAL from the 5 min default certainly improved this. However, we never see a final updated submitter ad (ie., 0 jobs running). The last one claims there is 1 job running and that is what we are left with. If that can be verified by looking at condor_status -submitter then it's a candidate for fixing. IIRC, submitter ads are generated from jobs in the queue. If there are no jobs for a submitter (all completed) I could imagine the Schedd just wouldn't know to send a final update (an invalidate!). Fixed for incorrect idle job stats on the scheduler and submitter (needed to augment the inbound classad a bit). Now we need a solution for the missing UPDATE_SUBMITTER_AD to update the submitter objects. Also I see this: ~/personal-condor/log $ condor_status -submitter Name Machine Running IdleJobs HeldJobs nobody localhost. 0 0 [???????] RunningJobs IdleJobs HeldJobs nobody 0 0 0 Total 0 0 0 (Omitted 1 malformed ads in computed attribute totals) We were missing a plugin update when we walk the owner list and there are no jobs. The collectors were getting this submitter update already - just needed to do the same for the schedd plugins also. FH 29c3f20c2ea |