Currently when the schedd plug-in is configured to publish job submissions, submission objects "hang around" beyond the life span of their jobs. However, they are effectively useless since we can't get job summaries from them once those jobs are gone. Propose to somehow reap those submission objects after some configurable period of time. Note that Justin and I have proposed that the schedd-plugin publish scenario is the functional equivalent of condor_q minus quill, etc. ie., you get to query jobs as long as they are in the queue. If users want persistent job info they need to use the job server.
Matt, appreciate feedback on proposal.
I support garbage collecting submissions, but on a long timeout. A submission is not a first class entity and is not closed from Condor's perspective.
Trevor, we're proposing to delete the C++ submission mgmt objects when the last job is completed/removed. However, the submission objects should hang around for a while (2-3 days?) in cumin db.
Quick proposal to QMF (jr+mf): allow for separation of ManagementObject memory reclamation and sending of Resource Destroy message
FH sha 89e65dc6 right on V7_4-QMF-branch manually tested
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: no ref counting of jobs to submissions in job server plug-in Consequences: management console would see submissions but unable to retrieve any job summaries from these Fix: added ref counting as jobs are attached to submissions Result: QMF submission objects are "cleaned up" once all their jobs have been completed or removed
How can I verify this bug?
Submit jobs through QMF or condor_submit to a schedd plug-in job server (QMF_PUBLISH_SUBMISSIONS=True). Use a devel tool like src/management/qmfprobe.py or submissions.py to verify that once all the procs in a submission have gone to COMPLETED or REMOVED, the QMF submission object is no longer available.
Reproduced on condor-7.4.4-0.17.el5
Verified on condor-qmf-7.4.5-0.6.el5
$ ./submit.py amqp://cumin/cumin@localhost:5672 $ ./qmfprobe.py | grep submission $ sleep 60 $ ./qmfprobe.py | grep submission
Oops, now I found there's a problem. The submission for job, which is running when I do "condor_rm -all" does not get removed. Reproducer: condor_submit << EOF Executable = /bin/sleep Universe = vanilla args = 20m queue 1 EOF condor_rm -all # Now see the Submissions in Cumin->Grid Tested on cumin-0.1.4462-1.el5
Submission objects of these already non-exiting jobs are also visible with src/management/qmfprobe.py
FH sha 8946a558e3 directly in V7_4-QMF-branch Proc count check needed to account for negative value since late ad updates on removed jobs could drive count below zero
I will re-check as soon as there are new packages containing the change mentioned in comment #15. Thank you.
More refactoring needed. Can't rely on proc counts from UNEXPANDED state if the schedd is restarted while a job is still running in the job queue.
FH sha f77bdd6 unit testing looks good with restarts of schedd counts are correct completion and removal cleans up submission object NB: completed/removed counts will zero out for a submission if the schedd is restarted and the cluster hasn't completely finished
Verified on RHEL5 x86_64 with condor-qmf-7.4.5-0.7.el5, will check RHEL4 and i386 packages today.
Oops, the running jobs count is still wrong in the Grid tab (where the pools are listed) and in Statistics of Overview.
Can that be a Cumin bug? cumin-0.1.4478-1.el5 If yes, then condor-qmf-7.4.5-0.7 seems to be fine on all architectures including RHEL4.
The wrong running jobs count seems to be gone after some time... I am investigating more now.
After rough measurement, it takes 5 minutes before the wrong statistics are gone. And I am testing with both QMF_UPDATE_INTERVAL and COLLECTOR_UPDATE_INTERVAL set to 1 (minute, AFAIK), qpidd's mgmt-pub-interval set to 5 (seconds) and cumin's page refresh set to 5 seconds as well.
(In reply to comment #25) > After rough measurement, it takes 5 minutes before the > wrong statistics are gone. > > And I am testing with both QMF_UPDATE_INTERVAL and > COLLECTOR_UPDATE_INTERVAL set to 1 (minute, AFAIK), > qpidd's mgmt-pub-interval set to 5 (seconds) > and cumin's page refresh set to 5 seconds as well. I reproduced this behavior using the packages above. It appears that on a job removal, the QMF object for the Collector will not contain an updated job count for 5 minutes. Cumin sees update messages on the given interval, and sample data is logged in the database, but the job count does not change if the system is left idle until the 5 minute mark has been passed. It seems that if a new job is submitted, however, the job count will be corrected shortly thereafter. This does not seem to be a cumin bug -- the value in the QMF data appears to be incorrect.
> > It seems that if a new job is submitted, however, the job count will be > corrected shortly thereafter. Forgot to mention: assuming this is true, then the implication is that on a grid with regular job submission this bug probably would go unnoticed. However, the root cause should be understood.
It's not a bug, it's the interval the Schedd uses to publish information about itself, which is then reflected into the QMF object space. SCHEDD_INTERVAL http://www.cs.wisc.edu/condor/manual/v7.5/3_3Configuration.html#17874
(In reply to comment #28) > It's not a bug, it's the interval the Schedd uses to publish information about > itself, which is then reflected into the QMF object space. > > SCHEDD_INTERVAL > > http://www.cs.wisc.edu/condor/manual/v7.5/3_3Configuration.html#17874 Thanks, that explains the 5 minutes. But is it by design that a job submission causes the schedd to publish, but a job removal does not? This is what I'm observing. Seems inconsistent to me.
It's possible. There are a number of things that may prevent the Schedd from sending an update to the collector (and QMF object). That should be a separate BZ.
Verified on condor-qmf-7.4.5-0.7.el5 condor-qmf-7.4.5-0.7.el4 both i386 and x86_64, RHEL4s had QMF_BROKER_HOST set to RHEL5 box with running Qpid, checked with qmfprobe.py The new bug which spang off above comments is Bug 673179
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,4 +1 @@ -Cause: no ref counting of jobs to submissions in job server plug-in +Schedd QMF left the Submission QMF objects hanging after all of submission jobs finished. This happend due to absence of ref counting of jobs on submissions in the Schedd QMF plug-in. With this update, ref counting as a job is attached to every submission and QMF submission objects are deleted once all their jobs are completed or removed.-Consequences: management console would see submissions but unable to retrieve any job summaries from these -Fix: added ref counting as jobs are attached to submissions -Result: QMF submission objects are "cleaned up" once all their jobs have been completed or removed
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0217.html