Bug 753829

Summary: Dag submissions have incorrect job totals from plugin publisher
Product: Red Hat Enterprise MRG Reporter: Pete MacKinnon <pmackinn>
Component: condor-qmfAssignee: Pete MacKinnon <pmackinn>
Status: CLOSED ERRATA QA Contact: Lubos Trilety <ltrilety>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.1CC: ltoscano, ltrilety, matt, mkudlej, tstclair
Target Milestone: 2.1.1   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: condor-7.6.5-0.10 Doc Type: Bug Fix
Doc Text:
Cause: Monitoring a DAG-based submission's job totals when the schedd QMF plug-in is used for job publishing. Consequence: The job totals are incorrect and do not properly accumulate as the DAG submission progresses through it's node job execution. Fix: A comparator for an internal collection that tracks active jobs in a submission was insufficient for the DAG case. Thus, DAG submissions were being prematurely destroyed and recreated. This is why job counts appeared incorrect. Result: DAG submission job state totals increase, decrease and accumulate consistently as viewed by a QMF client.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-02-06 18:17:59 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 765607    

Description Pete MacKinnon 2011-11-14 16:11:10 UTC
SCHEDD.PLUGINS = $(LIBEXEC)/MgmtScheddPlugin-plugin.so
QMF_PUBLISH_SUBMISSIONS = True

The above config will tell the schedd plugin to publish job submission objects for QMF. When a DAG job is submitted, the counts of the jobs (idle, running, etc.) are correct at points in time but do not persist. This is due to the fact that the internal C++ submission object in the plugin is being recreated on each dag node job submit, thus wiping out the overall job totals.

The same test using the condor_job_server job publisher shows correct totals (when the schedd update interval is accounted for).

Comment 1 Pete MacKinnon 2011-12-05 13:26:43 UTC
The comparator for std::set that tracks active jobs in a submission was insufficient for the dag case. Thus, dag submissions were being prematurely destroyed and recreated. This is why the job counts were off. 

UW commit a80cf51

Comment 6 Pete MacKinnon 2011-12-12 17:43:35 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: Monitoring a DAG-based submission's job totals when the schedd QMF plug-in is used for job publishing.

Consequence: The job totals are incorrect and do not properly accumulate as the DAG submission progresses through it's node job execution.

Fix: A comparator for an internal collection that tracks active jobs in a submission was insufficient for the DAG case. Thus, DAG submissions were being prematurely destroyed and recreated. This is why job counts were appeared incorrect.

Result: DAG submission ob state totals increase, decrease and accumulate consistently as viewed by a QMF client.

Comment 7 Pete MacKinnon 2011-12-12 17:44:37 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -4,4 +4,4 @@
 
 Fix: A comparator for an internal collection that tracks active jobs in a submission was insufficient for the DAG case. Thus, DAG submissions were being prematurely destroyed and recreated. This is why job counts were appeared incorrect.
 
-Result: DAG submission ob state totals increase, decrease and accumulate consistently as viewed by a QMF client.+Result: DAG submission job state totals increase, decrease and accumulate consistently as viewed by a QMF client.

Comment 8 Pete MacKinnon 2011-12-13 14:09:34 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -2,6 +2,6 @@
 
 Consequence: The job totals are incorrect and do not properly accumulate as the DAG submission progresses through it's node job execution.
 
-Fix: A comparator for an internal collection that tracks active jobs in a submission was insufficient for the DAG case. Thus, DAG submissions were being prematurely destroyed and recreated. This is why job counts were appeared incorrect.
+Fix: A comparator for an internal collection that tracks active jobs in a submission was insufficient for the DAG case. Thus, DAG submissions were being prematurely destroyed and recreated. This is why job counts appeared incorrect.
 
 Result: DAG submission job state totals increase, decrease and accumulate consistently as viewed by a QMF client.

Comment 10 Lubos Trilety 2012-01-06 15:07:43 UTC
Successfully reproduced on:
$CondorVersion: 7.6.3 Jul 27 2011 BuildID: RH-7.6.3-0.3.el5 $
$CondorPlatform: X86_64-RedHat_5.6 $

number of submissions in qmf doesn't correspond to condor_q statistics

Comment 11 Lubos Trilety 2012-01-06 15:25:53 UTC
Tested on:
$CondorVersion: 7.6.5 Dec 16 2011 BuildID: RH-7.6.5-0.11.el5 $
$CondorPlatform: I686-RedHat_5.7 $

$CondorVersion: 7.6.5 Dec 16 2011 BuildID: RH-7.6.5-0.11.el5 $
$CondorPlatform: X86_64-RedHat_5.7 $

$CondorVersion: 7.6.5 Dec 16 2011 BuildID: RH-7.6.5-0.11.el6 $
$CondorPlatform: I686-RedHat_6.2 $

$CondorVersion: 7.6.5 Dec 16 2011 BuildID: RH-7.6.5-0.11.el6 $
$CondorPlatform: X86_64-RedHat_6.2 $

Number of submission correspond better with condor_q statistics and it ends with there is 5 completed jobs after dagman job ends.

>>> VERIFIED

Comment 12 errata-xmlrpc 2012-02-06 18:17:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0100.html