Bug 753829 - Dag submissions have incorrect job totals from plugin publisher
Summary: Dag submissions have incorrect job totals from plugin publisher
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor-qmf
Version: 2.1
Hardware: Unspecified
OS: Linux
medium
medium
Target Milestone: 2.1.1
: ---
Assignee: Pete MacKinnon
QA Contact: Lubos Trilety
URL:
Whiteboard:
Depends On:
Blocks: 765607
TreeView+ depends on / blocked
 
Reported: 2011-11-14 16:11 UTC by Pete MacKinnon
Modified: 2012-03-28 09:43 UTC (History)
5 users (show)

Fixed In Version: condor-7.6.5-0.10
Doc Type: Bug Fix
Doc Text:
Cause: Monitoring a DAG-based submission's job totals when the schedd QMF plug-in is used for job publishing. Consequence: The job totals are incorrect and do not properly accumulate as the DAG submission progresses through it's node job execution. Fix: A comparator for an internal collection that tracks active jobs in a submission was insufficient for the DAG case. Thus, DAG submissions were being prematurely destroyed and recreated. This is why job counts appeared incorrect. Result: DAG submission job state totals increase, decrease and accumulate consistently as viewed by a QMF client.
Clone Of:
Environment:
Last Closed: 2012-02-06 18:17:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2012:0100 0 normal SHIPPED_LIVE Moderate: MRG Grid security, bug fix, and enhancement update 2012-02-06 23:15:47 UTC

Description Pete MacKinnon 2011-11-14 16:11:10 UTC
SCHEDD.PLUGINS = $(LIBEXEC)/MgmtScheddPlugin-plugin.so
QMF_PUBLISH_SUBMISSIONS = True

The above config will tell the schedd plugin to publish job submission objects for QMF. When a DAG job is submitted, the counts of the jobs (idle, running, etc.) are correct at points in time but do not persist. This is due to the fact that the internal C++ submission object in the plugin is being recreated on each dag node job submit, thus wiping out the overall job totals.

The same test using the condor_job_server job publisher shows correct totals (when the schedd update interval is accounted for).

Comment 1 Pete MacKinnon 2011-12-05 13:26:43 UTC
The comparator for std::set that tracks active jobs in a submission was insufficient for the dag case. Thus, dag submissions were being prematurely destroyed and recreated. This is why the job counts were off. 

UW commit a80cf51

Comment 6 Pete MacKinnon 2011-12-12 17:43:35 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: Monitoring a DAG-based submission's job totals when the schedd QMF plug-in is used for job publishing.

Consequence: The job totals are incorrect and do not properly accumulate as the DAG submission progresses through it's node job execution.

Fix: A comparator for an internal collection that tracks active jobs in a submission was insufficient for the DAG case. Thus, DAG submissions were being prematurely destroyed and recreated. This is why job counts were appeared incorrect.

Result: DAG submission ob state totals increase, decrease and accumulate consistently as viewed by a QMF client.

Comment 7 Pete MacKinnon 2011-12-12 17:44:37 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -4,4 +4,4 @@
 
 Fix: A comparator for an internal collection that tracks active jobs in a submission was insufficient for the DAG case. Thus, DAG submissions were being prematurely destroyed and recreated. This is why job counts were appeared incorrect.
 
-Result: DAG submission ob state totals increase, decrease and accumulate consistently as viewed by a QMF client.+Result: DAG submission job state totals increase, decrease and accumulate consistently as viewed by a QMF client.

Comment 8 Pete MacKinnon 2011-12-13 14:09:34 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -2,6 +2,6 @@
 
 Consequence: The job totals are incorrect and do not properly accumulate as the DAG submission progresses through it's node job execution.
 
-Fix: A comparator for an internal collection that tracks active jobs in a submission was insufficient for the DAG case. Thus, DAG submissions were being prematurely destroyed and recreated. This is why job counts were appeared incorrect.
+Fix: A comparator for an internal collection that tracks active jobs in a submission was insufficient for the DAG case. Thus, DAG submissions were being prematurely destroyed and recreated. This is why job counts appeared incorrect.
 
 Result: DAG submission job state totals increase, decrease and accumulate consistently as viewed by a QMF client.

Comment 10 Lubos Trilety 2012-01-06 15:07:43 UTC
Successfully reproduced on:
$CondorVersion: 7.6.3 Jul 27 2011 BuildID: RH-7.6.3-0.3.el5 $
$CondorPlatform: X86_64-RedHat_5.6 $

number of submissions in qmf doesn't correspond to condor_q statistics

Comment 11 Lubos Trilety 2012-01-06 15:25:53 UTC
Tested on:
$CondorVersion: 7.6.5 Dec 16 2011 BuildID: RH-7.6.5-0.11.el5 $
$CondorPlatform: I686-RedHat_5.7 $

$CondorVersion: 7.6.5 Dec 16 2011 BuildID: RH-7.6.5-0.11.el5 $
$CondorPlatform: X86_64-RedHat_5.7 $

$CondorVersion: 7.6.5 Dec 16 2011 BuildID: RH-7.6.5-0.11.el6 $
$CondorPlatform: I686-RedHat_6.2 $

$CondorVersion: 7.6.5 Dec 16 2011 BuildID: RH-7.6.5-0.11.el6 $
$CondorPlatform: X86_64-RedHat_6.2 $

Number of submission correspond better with condor_q statistics and it ends with there is 5 completed jobs after dagman job ends.

>>> VERIFIED

Comment 12 errata-xmlrpc 2012-02-06 18:17:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0100.html


Note You need to log in before you can comment on or make changes to this bug.