Bug 595704 - QMF: submission objects in schedd-plugin scenario should go away after jobs complete/removed
Summary: QMF: submission objects in schedd-plugin scenario should go away after jobs c...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor-qmf
Version: 1.3
Hardware: All
OS: Linux
high
medium
Target Milestone: 1.3.2
: ---
Assignee: Pete MacKinnon
QA Contact: Jan Sarenik
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-05-25 12:23 UTC by Pete MacKinnon
Modified: 2011-02-15 12:12 UTC (History)
5 users (show)

Fixed In Version: condor-7.4.5-0.7
Doc Type: Bug Fix
Doc Text:
Schedd QMF left the Submission QMF objects hanging after all of submission jobs finished. This happend due to absence of ref counting of jobs on submissions in the Schedd QMF plug-in. With this update, ref counting as a job is attached to every submission and QMF submission objects are deleted once all their jobs are completed or removed.
Clone Of:
Environment:
Last Closed: 2011-02-15 12:12:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0217 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging and Grid bug fix and enhancement update 2011-02-15 12:10:15 UTC

Description Pete MacKinnon 2010-05-25 12:23:52 UTC
Currently when the schedd plug-in is configured to publish job submissions, submission objects "hang around" beyond the life span of their jobs. However, they are effectively useless since we can't get job summaries from them once those jobs are gone. 

Propose to somehow reap those submission objects after some configurable period of time.

Note that Justin and I have proposed that the schedd-plugin publish scenario is the functional equivalent of condor_q minus quill, etc. ie., you get to query jobs as long as they are in the queue. If users want persistent job info they need to use the job server.

Comment 1 Pete MacKinnon 2010-05-25 12:26:31 UTC
Matt, appreciate feedback on proposal.

Comment 2 Matthew Farrellee 2010-05-25 14:54:53 UTC
I support garbage collecting submissions, but on a long timeout. A submission is not a first class entity and is not closed from Condor's perspective.

Comment 3 Pete MacKinnon 2010-11-04 16:04:24 UTC
Trevor, we're proposing to delete the C++ submission mgmt objects when the last job is completed/removed. However, the submission objects should hang around for a while (2-3 days?) in cumin db.

Comment 4 Matthew Farrellee 2010-11-04 19:16:10 UTC
Quick proposal to QMF (jr+mf): allow for separation of ManagementObject memory reclamation and sending of Resource Destroy message

Comment 5 Pete MacKinnon 2010-12-03 21:19:33 UTC
FH sha 89e65dc6 right on V7_4-QMF-branch

manually tested

Comment 6 Pete MacKinnon 2010-12-10 15:05:03 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: no ref counting of jobs to submissions in job server plug-in
Consequences: management console would see submissions but unable to retrieve any job summaries from these 
Fix: added ref counting as jobs are attached to submissions
Result: QMF submission objects are "cleaned up" once all their jobs have been completed or removed

Comment 8 Jan Sarenik 2011-01-07 15:39:05 UTC
How can I verify this bug?

Comment 9 Pete MacKinnon 2011-01-10 15:13:53 UTC
Submit jobs through QMF or condor_submit to a schedd plug-in job server (QMF_PUBLISH_SUBMISSIONS=True).

Use a devel tool like src/management/qmfprobe.py or submissions.py to verify that once all the procs in a submission have gone to COMPLETED or REMOVED, the QMF submission object is no longer available.

Comment 10 Jan Sarenik 2011-01-11 10:31:15 UTC
Reproduced on condor-7.4.4-0.17.el5

Comment 11 Jan Sarenik 2011-01-12 10:24:24 UTC
Verified on condor-qmf-7.4.5-0.6.el5

Comment 12 Jan Sarenik 2011-01-12 10:25:25 UTC
$ ./submit.py amqp://cumin/cumin@localhost:5672
$ ./qmfprobe.py | grep submission
$ sleep 60
$ ./qmfprobe.py | grep submission

Comment 13 Jan Sarenik 2011-01-13 15:21:40 UTC
Oops, now I found there's a problem.
The submission for job, which is running
when I do "condor_rm -all" does not get
removed.

Reproducer:

condor_submit << EOF
Executable     = /bin/sleep
Universe = vanilla
args    = 20m
queue 1
EOF
condor_rm -all
# Now see the Submissions in Cumin->Grid


Tested on cumin-0.1.4462-1.el5

Comment 14 Jan Sarenik 2011-01-13 15:23:37 UTC
Submission objects of these already non-exiting
jobs are also visible with src/management/qmfprobe.py

Comment 15 Pete MacKinnon 2011-01-14 21:18:23 UTC
FH sha 8946a558e3 directly in V7_4-QMF-branch

Proc count check needed to account for negative value since 
late ad updates on removed jobs could drive count below zero

Comment 17 Jan Sarenik 2011-01-17 15:04:38 UTC
I will re-check as soon as there are new packages
containing the change mentioned in comment #15.
Thank you.

Comment 18 Pete MacKinnon 2011-01-17 18:08:06 UTC
More refactoring needed. Can't rely on proc counts from UNEXPANDED state if the schedd is restarted while a job is still running in the job queue.

Comment 19 Pete MacKinnon 2011-01-18 02:05:33 UTC
FH sha f77bdd6

unit testing looks good with restarts of schedd
counts are correct
completion and removal cleans up submission object
NB: completed/removed counts will zero out for a submission if the schedd is restarted and the cluster hasn't completely finished

Comment 21 Jan Sarenik 2011-01-26 08:16:30 UTC
Verified on RHEL5 x86_64 with condor-qmf-7.4.5-0.7.el5,
will check RHEL4 and i386 packages today.

Comment 22 Jan Sarenik 2011-01-26 11:41:26 UTC
Oops, the running jobs count is still wrong in the Grid tab
(where the pools are listed) and in Statistics of Overview.

Comment 23 Jan Sarenik 2011-01-26 11:56:31 UTC
Can that be a Cumin bug? cumin-0.1.4478-1.el5
If yes, then condor-qmf-7.4.5-0.7 seems to be
fine on all architectures including RHEL4.

Comment 24 Jan Sarenik 2011-01-26 12:13:49 UTC
The wrong running jobs count seems to be gone after
some time... I am investigating more now.

Comment 25 Jan Sarenik 2011-01-26 12:23:20 UTC
After rough measurement, it takes 5 minutes before the
wrong statistics are gone.

And I am testing with both QMF_UPDATE_INTERVAL and
COLLECTOR_UPDATE_INTERVAL set to 1 (minute, AFAIK),
qpidd's mgmt-pub-interval set to 5 (seconds)
and cumin's page refresh set to 5 seconds as well.

Comment 26 Trevor McKay 2011-01-26 21:01:16 UTC
(In reply to comment #25)
> After rough measurement, it takes 5 minutes before the
> wrong statistics are gone.
> 
> And I am testing with both QMF_UPDATE_INTERVAL and
> COLLECTOR_UPDATE_INTERVAL set to 1 (minute, AFAIK),
> qpidd's mgmt-pub-interval set to 5 (seconds)
> and cumin's page refresh set to 5 seconds as well.

I reproduced this behavior using the packages above.  It appears that on a job removal, the QMF object for the Collector will not contain an updated job count for 5 minutes.  Cumin sees update messages on the given interval, and sample data is logged in the database, but the job count does not change if the system is left idle until the 5 minute mark has been passed.

It seems that if a new job is submitted, however, the job count will be corrected shortly thereafter.

This does not seem to be a cumin bug -- the value in the QMF data appears to be incorrect.

Comment 27 Trevor McKay 2011-01-26 21:04:53 UTC
> 
> It seems that if a new job is submitted, however, the job count will be
> corrected shortly thereafter.

Forgot to mention: assuming this is true, then the implication is that on a grid with regular job submission this bug probably would go unnoticed.  However, the root cause should be understood.

Comment 28 Matthew Farrellee 2011-01-27 01:37:12 UTC
It's not a bug, it's the interval the Schedd uses to publish information about itself, which is then reflected into the QMF object space.

SCHEDD_INTERVAL

http://www.cs.wisc.edu/condor/manual/v7.5/3_3Configuration.html#17874

Comment 29 Trevor McKay 2011-01-27 13:39:25 UTC
(In reply to comment #28)
> It's not a bug, it's the interval the Schedd uses to publish information about
> itself, which is then reflected into the QMF object space.
> 
> SCHEDD_INTERVAL
> 
> http://www.cs.wisc.edu/condor/manual/v7.5/3_3Configuration.html#17874

Thanks, that explains the 5 minutes.  But is it by design that a job submission causes the schedd to publish, but a job removal does not?  This is what I'm observing.  Seems inconsistent to me.

Comment 30 Matthew Farrellee 2011-01-27 13:55:39 UTC
It's possible. There are a number of things that may prevent the Schedd from sending an update to the collector (and QMF object). That should be a separate BZ.

Comment 31 Jan Sarenik 2011-01-27 21:33:55 UTC
Verified on
condor-qmf-7.4.5-0.7.el5
condor-qmf-7.4.5-0.7.el4
both i386 and x86_64,

RHEL4s had QMF_BROKER_HOST set to RHEL5
box with running Qpid, checked with qmfprobe.py 

The new bug which spang off above comments is
Bug 673179

Comment 32 Eva Kopalova 2011-02-09 16:38:41 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,4 +1 @@
-Cause: no ref counting of jobs to submissions in job server plug-in
+Schedd QMF left the Submission QMF objects hanging after all of submission jobs finished. This happend due to absence of ref counting of jobs on submissions in the Schedd QMF plug-in. With this update, ref counting as a job is attached to every submission and QMF submission objects are deleted once all their jobs are completed or removed.-Consequences: management console would see submissions but unable to retrieve any job summaries from these 
-Fix: added ref counting as jobs are attached to submissions
-Result: QMF submission objects are "cleaned up" once all their jobs have been completed or removed

Comment 33 errata-xmlrpc 2011-02-15 12:12:47 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0217.html


Note You need to log in before you can comment on or make changes to this bug.