Bug 840076
Summary: | Job history collection daemon and tool | |||
---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Pete MacKinnon <pmackinn> | |
Component: | condor-plumage | Assignee: | Pete MacKinnon <pmackinn> | |
Status: | CLOSED ERRATA | QA Contact: | Daniel Horák <dahorak> | |
Severity: | medium | Docs Contact: | ||
Priority: | high | |||
Version: | Development | CC: | dahorak, dryan, iboverma, ltoscano, matt, mkudlej, rrati, tstclair | |
Target Milestone: | 2.3 | Keywords: | FutureFeature | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | condor-7.8.8-0.2 | Doc Type: | Enhancement | |
Doc Text: |
Feature: Introduce collection and storage of job history into mongodb, including a standalone python client is used to query the job history. This does not replace existing history file infrastructure or tools like condor_history.
Reason: Move job history onto a more robust, scalable and manageable backend data source for enterprise deployments.
Result (if any): Completed
|
Story Points: | --- | |
Clone Of: | ||||
: | 876834 (view as bug list) | Environment: | ||
Last Closed: | 2013-03-06 18:44:36 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 850563, 876834 |
Description
Pete MacKinnon
2012-07-13 15:50:15 UTC
How can we test this new tools and collection? How functionality does it cover? We need more detailed description what will be new in plumage and how many new tools will introduce this change so we will be able to plan our testing and potential testing automation. The new tool will be analagous to the existing condor_history tool: http://research.cs.wisc.edu/condor/manual/v7.8/condor_history.html Although the data fields output will be likely identical there will be arguments that relate to the mongod server connection as opposed to the various existing file and directory arguments. The output should include: ID The cluster/process id of the job. OWNER The owner of the job. SUBMITTED The month, day, hour, and minute the job was submitted to the queue. RUN_TIME Remote wall clock time accumulated by the job to date in days, hours, minutes, and seconds. ST Completion status of the job (C = completed and X = removed). COMPLETED The time the job was completed. CMD The name of the executable. Wisdom on testing the new backend (using the proposed tool or ad-hoc mongodb clients): 1) Are any COMPLETED or REMOVED jobs missing? 2) Is the history job data accurate and complete? 3) Can I list jobs in forward and reverse chronological order? 4) Can I limit the output to a specified page size? 5) Can I see a long listing (i.e., all job attributes) for a job? I'll revise comment #1 to say that the tool may not necessarily be implemented in python (pymongo). Or comment #0 even... if those tools are implemented in different language than python can you please check their new dependencies in RHEL. It can prevent time overhead before giving packages to us. thanks Of course. It *may* be implemented in C/C++ which would be constrained to existing Condor devel dependencies plus mongodb-devel (all current deps). As with the other Plumage components, this would be available on RHEL6 only. Help for plumage_history mention "--sub=SUB" parameter. Inside is this parameter used for filtering against "Submission" parameter in mongodb, but no such parameter is in database present and also 'condor_history -l' doesn't know it. What exactly do this parameter? This parameter returns all jobs that have a "Submission" attribute that matches the supplied argument. I'll concede that a non-Aviary, non-QMF deployment would not generate these but Plumage is part of the Grid ecosystem. Obviously a similar CLI arg doesn't exist in the legacy condor_history tool. Tested and verified on RHEL 6.4 i386/x86_64 with: # rpm -qa | grep -e condor -e plumage -e mongo | sort condor-7.8.8-0.3.el6.x86_64 condor-aviary-7.8.8-0.3.el6.x86_64 condor-classads-7.8.8-0.3.el6.x86_64 condor-plumage-7.8.8-0.3.el6.x86_64 condor-qmf-7.8.8-0.3.el6.x86_64 mongodb-1.6.4-4.el6.x86_64 mongodb-server-1.6.4-4.el6.x86_64 pymongo-1.9-8.el6.x86_64 Job history is correctly collected to mongodb. Parameters work as expected (with one exception [1]) and data are consistent with condor_history: # plumage_history ID OWNER SUBMITTED RUN_TIME ST COMPLETED/REMOVED CMD 2.4 test5 Wed Jan 16 11:20:51 2013 00:00:21 C Wed Jan 16 11:23:26 2013 /bin/sleep 20 2.3 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:23:24 2013 /bin/sleep 20 2.2 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:23:24 2013 /bin/sleep 20 2.1 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:23:24 2013 /bin/sleep 20 2.0 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:23:05 2013 /bin/sleep 20 1.4 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:45 2013 /bin/sleep 20 3.4 test3 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:45 2013 /bin/sleep 20 1.3 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:25 2013 /bin/sleep 20 3.3 test3 Wed Jan 16 11:20:51 2013 00:00:21 C Wed Jan 16 11:22:25 2013 /bin/sleep 20 4.4 test1 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:24 2013 /bin/sleep 20 4.3 test1 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:24 2013 /bin/sleep 20 1.2 test5 Wed Jan 16 11:20:51 2013 00:00:21 C Wed Jan 16 11:22:05 2013 /bin/sleep 20 3.2 test3 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:04 2013 /bin/sleep 20 4.2 test1 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:04 2013 /bin/sleep 20 1.1 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:21:44 2013 /bin/sleep 20 3.1 test3 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:21:44 2013 /bin/sleep 20 4.1 test1 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:21:44 2013 /bin/sleep 20 3.0 test3 Wed Jan 16 11:20:51 2013 00:00:21 C Wed Jan 16 11:21:24 2013 /bin/sleep 20 1.0 test5 Wed Jan 16 11:20:51 2013 00:00:21 C Wed Jan 16 11:21:24 2013 /bin/sleep 20 4.0 test1 Wed Jan 16 11:20:51 2013 00:00:21 C Wed Jan 16 11:21:24 2013 /bin/sleep 20 # plumage_history -s HOST ID OWNER SUBMITTED RUN_TIME ST COMPLETED/REMOVED CMD 2.4 test5 Wed Jan 16 11:20:51 2013 00:00:21 C Wed Jan 16 11:23:26 2013 /bin/sleep 20 2.3 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:23:24 2013 /bin/sleep 20 2.2 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:23:24 2013 /bin/sleep 20 2.1 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:23:24 2013 /bin/sleep 20 2.0 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:23:05 2013 /bin/sleep 20 1.4 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:45 2013 /bin/sleep 20 3.4 test3 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:45 2013 /bin/sleep 20 1.3 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:25 2013 /bin/sleep 20 3.3 test3 Wed Jan 16 11:20:51 2013 00:00:21 C Wed Jan 16 11:22:25 2013 /bin/sleep 20 4.4 test1 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:24 2013 /bin/sleep 20 4.3 test1 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:24 2013 /bin/sleep 20 1.2 test5 Wed Jan 16 11:20:51 2013 00:00:21 C Wed Jan 16 11:22:05 2013 /bin/sleep 20 3.2 test3 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:04 2013 /bin/sleep 20 4.2 test1 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:04 2013 /bin/sleep 20 1.1 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:21:44 2013 /bin/sleep 20 3.1 test3 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:21:44 2013 /bin/sleep 20 4.1 test1 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:21:44 2013 /bin/sleep 20 3.0 test3 Wed Jan 16 11:20:51 2013 00:00:21 C Wed Jan 16 11:21:24 2013 /bin/sleep 20 1.0 test5 Wed Jan 16 11:20:51 2013 00:00:21 C Wed Jan 16 11:21:24 2013 /bin/sleep 20 4.0 test1 Wed Jan 16 11:20:51 2013 00:00:21 C Wed Jan 16 11:21:24 2013 /bin/sleep 20 # plumage_history -f ID OWNER SUBMITTED RUN_TIME ST COMPLETED/REMOVED CMD 4.0 test1 Wed Jan 16 11:20:51 2013 00:00:21 C Wed Jan 16 11:21:24 2013 /bin/sleep 20 1.0 test5 Wed Jan 16 11:20:51 2013 00:00:21 C Wed Jan 16 11:21:24 2013 /bin/sleep 20 3.0 test3 Wed Jan 16 11:20:51 2013 00:00:21 C Wed Jan 16 11:21:24 2013 /bin/sleep 20 4.1 test1 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:21:44 2013 /bin/sleep 20 3.1 test3 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:21:44 2013 /bin/sleep 20 1.1 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:21:44 2013 /bin/sleep 20 4.2 test1 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:04 2013 /bin/sleep 20 3.2 test3 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:04 2013 /bin/sleep 20 1.2 test5 Wed Jan 16 11:20:51 2013 00:00:21 C Wed Jan 16 11:22:05 2013 /bin/sleep 20 4.3 test1 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:24 2013 /bin/sleep 20 4.4 test1 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:24 2013 /bin/sleep 20 3.3 test3 Wed Jan 16 11:20:51 2013 00:00:21 C Wed Jan 16 11:22:25 2013 /bin/sleep 20 1.3 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:25 2013 /bin/sleep 20 3.4 test3 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:45 2013 /bin/sleep 20 1.4 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:45 2013 /bin/sleep 20 2.0 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:23:05 2013 /bin/sleep 20 2.1 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:23:24 2013 /bin/sleep 20 2.2 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:23:24 2013 /bin/sleep 20 2.3 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:23:24 2013 /bin/sleep 20 2.4 test5 Wed Jan 16 11:20:51 2013 00:00:21 C Wed Jan 16 11:23:26 2013 /bin/sleep 20 # plumage_history -c 1 ID OWNER SUBMITTED RUN_TIME ST COMPLETED/REMOVED CMD 1.0 test5 Wed Jan 16 11:20:51 2013 00:00:21 C Wed Jan 16 11:21:24 2013 /bin/sleep 20 1.1 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:21:44 2013 /bin/sleep 20 1.2 test5 Wed Jan 16 11:20:51 2013 00:00:21 C Wed Jan 16 11:22:05 2013 /bin/sleep 20 1.3 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:25 2013 /bin/sleep 20 1.4 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:45 2013 /bin/sleep 20 # plumage_history -o test1 ID OWNER SUBMITTED RUN_TIME ST COMPLETED/REMOVED CMD 4.0 test1 Wed Jan 16 11:20:51 2013 00:00:21 C Wed Jan 16 11:21:24 2013 /bin/sleep 20 4.1 test1 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:21:44 2013 /bin/sleep 20 4.2 test1 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:04 2013 /bin/sleep 20 4.3 test1 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:24 2013 /bin/sleep 20 4.4 test1 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:24 2013 /bin/sleep 20 # plumage_history -S 'HOST#1' ID OWNER SUBMITTED RUN_TIME ST COMPLETED/REMOVED CMD 1.0 test5 Wed Jan 16 11:20:51 2013 00:00:21 C Wed Jan 16 11:21:24 2013 /bin/sleep 20 1.1 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:21:44 2013 /bin/sleep 20 1.2 test5 Wed Jan 16 11:20:51 2013 00:00:21 C Wed Jan 16 11:22:05 2013 /bin/sleep 20 1.3 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:25 2013 /bin/sleep 20 1.4 test5 Wed Jan 16 11:20:51 2013 00:00:20 C Wed Jan 16 11:22:45 2013 /bin/sleep 20 # plumage_history -l Args = 20 BufferBlockSize = 32768 BufferSize = 524288 BytesRecvd = 0.0 BytesSent = 0.0 ClusterId = 2 Cmd = /bin/sleep << truncated >> WantCheckpoint = False WantRemoteIO = True WantRemoteSyscalls = False # plumage_history -l | grep -i globaljobid GlobalJobId = HOST#2.4#1358331651 GlobalJobId = HOST#2.3#1358331651 GlobalJobId = HOST#2.2#1358331651 GlobalJobId = HOST#2.1#1358331651 GlobalJobId = HOST#2.0#1358331651 GlobalJobId = HOST#1.4#1358331651 GlobalJobId = HOST#3.4#1358331651 GlobalJobId = HOST#1.3#1358331651 GlobalJobId = HOST#3.3#1358331651 GlobalJobId = HOST#4.4#1358331651 GlobalJobId = HOST#4.3#1358331651 GlobalJobId = HOST#1.2#1358331651 GlobalJobId = HOST#3.2#1358331651 GlobalJobId = HOST#4.2#1358331651 GlobalJobId = HOST#1.1#1358331651 GlobalJobId = HOST#3.1#1358331651 GlobalJobId = HOST#4.1#1358331651 GlobalJobId = HOST#3.0#1358331651 GlobalJobId = HOST#1.0#1358331651 GlobalJobId = HOST#4.0#1358331651 # plumage_history -l -f | grep -i globaljobid GlobalJobId = HOST#4.0#1358331651 GlobalJobId = HOST#1.0#1358331651 GlobalJobId = HOST#3.0#1358331651 GlobalJobId = HOST#4.1#1358331651 GlobalJobId = HOST#3.1#1358331651 GlobalJobId = HOST#1.1#1358331651 GlobalJobId = HOST#4.2#1358331651 GlobalJobId = HOST#3.2#1358331651 GlobalJobId = HOST#1.2#1358331651 GlobalJobId = HOST#4.3#1358331651 GlobalJobId = HOST#4.4#1358331651 GlobalJobId = HOST#3.3#1358331651 GlobalJobId = HOST#1.3#1358331651 GlobalJobId = HOST#3.4#1358331651 GlobalJobId = HOST#1.4#1358331651 GlobalJobId = HOST#2.0#1358331651 GlobalJobId = HOST#2.1#1358331651 GlobalJobId = HOST#2.2#1358331651 GlobalJobId = HOST#2.3#1358331651 GlobalJobId = HOST#2.4#1358331651 # plumage_history -l -c 1 | grep -i globaljobid GlobalJobId = HOST#1.0#1358331651 GlobalJobId = HOST#1.1#1358331651 GlobalJobId = HOST#1.2#1358331651 GlobalJobId = HOST#1.3#1358331651 GlobalJobId = HOST#1.4#1358331651 # plumage_history -l -o test1 | grep -i globaljobid GlobalJobId = HOST#4.0#1358331651 GlobalJobId = HOST#4.1#1358331651 GlobalJobId = HOST#4.2#1358331651 GlobalJobId = HOST#4.3#1358331651 GlobalJobId = HOST#4.4#1358331651 # plumage_history -l -S 'HOST#1' | grep -i globaljobid GlobalJobId = HOST#1.0#1358331651 GlobalJobId = HOST#1.1#1358331651 GlobalJobId = HOST#1.2#1358331651 GlobalJobId = HOST#1.3#1358331651 GlobalJobId = HOST#1.4#1358331651 I also check running mongodb server on different machine than condor and it works correctly. In case of reach mongodb size limit on 32 bit system, message in JobEtLog correctly describe the problem: # tail -F /var/log/condor/JobEtlLog ... 01/16/13 11:26:57 mongodb getLastError: can't map file memory - mongo requires 64 bit build for larger datasets 01/16/13 11:26:57 ODSHistoryFile::poll: unable to write history job ad to ODS for 'JOBID1' 01/16/13 11:26:57 mongodb getLastError: can't map file memory - mongo requires 64 bit build for larger datasets 01/16/13 11:26:57 ODSHistoryFile::poll: unable to write history job ad to ODS for 'JOBID2' 01/16/13 11:26:57 mongodb getLastError: can't map file memory - mongo requires 64 bit build for larger datasets 01/16/13 11:26:57 ODSHistoryFile::poll: unable to write history job ad to ODS for 'JOBID3' ... [1] Bug 895985 - plumage_history: parameter --forward doesn't work with some other parameters [2] Bug 888706 - RFE: Authentication and authorization for Plumage, Bug 892767 - Harden mongodb-server default bind_ip Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0564.html |