Bug 840076

Summary: Job history collection daemon and tool
Product: Red Hat Enterprise MRG Reporter: Pete MacKinnon <pmackinn>
Component: condor-plumageAssignee: Pete MacKinnon <pmackinn>
Status: CLOSED ERRATA QA Contact: Daniel Horák <dahorak>
Severity: medium Docs Contact:
Priority: high    
Version: DevelopmentCC: dahorak, dryan, iboverma, ltoscano, matt, mkudlej, rrati, tstclair
Target Milestone: 2.3Keywords: FutureFeature
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: condor-7.8.8-0.2 Doc Type: Enhancement
Doc Text:
Feature: Introduce collection and storage of job history into mongodb, including a standalone python client is used to query the job history. This does not replace existing history file infrastructure or tools like condor_history. Reason: Move job history onto a more robust, scalable and manageable backend data source for enterprise deployments. Result (if any): Completed
Story Points: ---
Clone Of:
: 876834 (view as bug list) Environment:
Last Closed: 2013-03-06 18:44:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 850563, 876834    

Description Pete MacKinnon 2012-07-13 15:50:15 UTC
Introduce collection and storage of job history into mongodb (and other backends potentially), thereby replacing the existing flat file system. This is a first step to migrating the Aviary query server to provide job data from the Plumage ODS (not included in this scope).

A standalone python client will also be used to query the job history.

Comment 2 Martin Kudlej 2012-08-01 10:08:13 UTC
How can we test this new tools and collection? How functionality does it cover?
We need more detailed description what will be new in plumage and how many new tools will introduce this change so we will be able to plan our testing and potential testing automation.

Comment 3 Pete MacKinnon 2012-08-01 14:11:52 UTC
The new tool will be analagous to the existing condor_history tool:

http://research.cs.wisc.edu/condor/manual/v7.8/condor_history.html

Although the data fields output will be likely identical there will be arguments that relate to the mongod server connection as opposed to the various existing file and directory arguments.

The output should include:
ID
    The cluster/process id of the job. 
OWNER
    The owner of the job. 
SUBMITTED
    The month, day, hour, and minute the job was submitted to the queue. 
RUN_TIME
    Remote wall clock time accumulated by the job to date in days, hours, minutes, and seconds.
ST
    Completion status of the job (C = completed and X = removed). 
COMPLETED
    The time the job was completed. 
CMD
    The name of the executable.

Wisdom on testing the new backend (using the proposed tool or ad-hoc mongodb clients):

1) Are any COMPLETED or REMOVED jobs missing?
2) Is the history job data accurate and complete?
3) Can I list jobs in forward and reverse chronological order?
4) Can I limit the output to a specified page size?
5) Can I see a long listing (i.e., all job attributes) for a job?

Comment 4 Pete MacKinnon 2012-08-01 14:13:15 UTC
I'll revise comment #1 to say that the tool may not necessarily be implemented in python (pymongo).

Comment 5 Pete MacKinnon 2012-08-01 14:13:51 UTC
Or comment #0 even...

Comment 6 Martin Kudlej 2012-08-01 15:26:39 UTC
if those tools are implemented in different language than python can you please check their new dependencies in RHEL. It can prevent time overhead before giving packages to us. thanks

Comment 7 Pete MacKinnon 2012-08-01 20:15:08 UTC
Of course. It *may* be implemented in C/C++ which would be constrained to existing Condor devel dependencies plus mongodb-devel (all current deps). As with the other Plumage components, this would be available on RHEL6 only.

Comment 8 Daniel Horák 2012-11-07 13:01:59 UTC
Help for plumage_history mention "--sub=SUB" parameter. 
Inside is this parameter used for filtering against "Submission" parameter in mongodb, but no such parameter is in database present and also 'condor_history -l' doesn't know it.

What exactly do this parameter?

Comment 9 Pete MacKinnon 2012-11-07 14:05:30 UTC
This parameter returns all jobs that have a "Submission" attribute that matches the supplied argument. I'll concede that a non-Aviary, non-QMF deployment would not generate these but Plumage is part of the Grid ecosystem. Obviously a similar CLI arg doesn't exist in the legacy condor_history tool.

Comment 19 Daniel Horák 2013-01-16 13:26:02 UTC
Tested and verified on RHEL 6.4 i386/x86_64 with:
# rpm -qa | grep -e condor -e plumage -e mongo | sort
  condor-7.8.8-0.3.el6.x86_64
  condor-aviary-7.8.8-0.3.el6.x86_64
  condor-classads-7.8.8-0.3.el6.x86_64
  condor-plumage-7.8.8-0.3.el6.x86_64
  condor-qmf-7.8.8-0.3.el6.x86_64
  mongodb-1.6.4-4.el6.x86_64
  mongodb-server-1.6.4-4.el6.x86_64
  pymongo-1.9-8.el6.x86_64

Job history is correctly collected to mongodb.
Parameters work as expected (with one exception [1]) and data are consistent with condor_history:

# plumage_history
  ID      OWNER            SUBMITTED                   RUN_TIME   ST  COMPLETED/REMOVED           CMD
  2.4     test5            Wed Jan 16 11:20:51 2013    00:00:21   C   Wed Jan 16 11:23:26 2013    /bin/sleep 20
  2.3     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:23:24 2013    /bin/sleep 20
  2.2     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:23:24 2013    /bin/sleep 20
  2.1     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:23:24 2013    /bin/sleep 20
  2.0     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:23:05 2013    /bin/sleep 20
  1.4     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:45 2013    /bin/sleep 20
  3.4     test3            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:45 2013    /bin/sleep 20
  1.3     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:25 2013    /bin/sleep 20
  3.3     test3            Wed Jan 16 11:20:51 2013    00:00:21   C   Wed Jan 16 11:22:25 2013    /bin/sleep 20
  4.4     test1            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:24 2013    /bin/sleep 20
  4.3     test1            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:24 2013    /bin/sleep 20
  1.2     test5            Wed Jan 16 11:20:51 2013    00:00:21   C   Wed Jan 16 11:22:05 2013    /bin/sleep 20
  3.2     test3            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:04 2013    /bin/sleep 20
  4.2     test1            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:04 2013    /bin/sleep 20
  1.1     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:21:44 2013    /bin/sleep 20
  3.1     test3            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:21:44 2013    /bin/sleep 20
  4.1     test1            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:21:44 2013    /bin/sleep 20
  3.0     test3            Wed Jan 16 11:20:51 2013    00:00:21   C   Wed Jan 16 11:21:24 2013    /bin/sleep 20
  1.0     test5            Wed Jan 16 11:20:51 2013    00:00:21   C   Wed Jan 16 11:21:24 2013    /bin/sleep 20
  4.0     test1            Wed Jan 16 11:20:51 2013    00:00:21   C   Wed Jan 16 11:21:24 2013    /bin/sleep 20

# plumage_history -s HOST
  ID      OWNER            SUBMITTED                   RUN_TIME   ST  COMPLETED/REMOVED           CMD
  2.4     test5            Wed Jan 16 11:20:51 2013    00:00:21   C   Wed Jan 16 11:23:26 2013    /bin/sleep 20
  2.3     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:23:24 2013    /bin/sleep 20
  2.2     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:23:24 2013    /bin/sleep 20
  2.1     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:23:24 2013    /bin/sleep 20
  2.0     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:23:05 2013    /bin/sleep 20
  1.4     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:45 2013    /bin/sleep 20
  3.4     test3            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:45 2013    /bin/sleep 20
  1.3     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:25 2013    /bin/sleep 20
  3.3     test3            Wed Jan 16 11:20:51 2013    00:00:21   C   Wed Jan 16 11:22:25 2013    /bin/sleep 20
  4.4     test1            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:24 2013    /bin/sleep 20
  4.3     test1            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:24 2013    /bin/sleep 20
  1.2     test5            Wed Jan 16 11:20:51 2013    00:00:21   C   Wed Jan 16 11:22:05 2013    /bin/sleep 20
  3.2     test3            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:04 2013    /bin/sleep 20
  4.2     test1            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:04 2013    /bin/sleep 20
  1.1     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:21:44 2013    /bin/sleep 20
  3.1     test3            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:21:44 2013    /bin/sleep 20
  4.1     test1            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:21:44 2013    /bin/sleep 20
  3.0     test3            Wed Jan 16 11:20:51 2013    00:00:21   C   Wed Jan 16 11:21:24 2013    /bin/sleep 20
  1.0     test5            Wed Jan 16 11:20:51 2013    00:00:21   C   Wed Jan 16 11:21:24 2013    /bin/sleep 20
  4.0     test1            Wed Jan 16 11:20:51 2013    00:00:21   C   Wed Jan 16 11:21:24 2013    /bin/sleep 20

# plumage_history -f
  ID      OWNER            SUBMITTED                   RUN_TIME   ST  COMPLETED/REMOVED           CMD
  4.0     test1            Wed Jan 16 11:20:51 2013    00:00:21   C   Wed Jan 16 11:21:24 2013    /bin/sleep 20
  1.0     test5            Wed Jan 16 11:20:51 2013    00:00:21   C   Wed Jan 16 11:21:24 2013    /bin/sleep 20
  3.0     test3            Wed Jan 16 11:20:51 2013    00:00:21   C   Wed Jan 16 11:21:24 2013    /bin/sleep 20
  4.1     test1            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:21:44 2013    /bin/sleep 20
  3.1     test3            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:21:44 2013    /bin/sleep 20
  1.1     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:21:44 2013    /bin/sleep 20
  4.2     test1            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:04 2013    /bin/sleep 20
  3.2     test3            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:04 2013    /bin/sleep 20
  1.2     test5            Wed Jan 16 11:20:51 2013    00:00:21   C   Wed Jan 16 11:22:05 2013    /bin/sleep 20
  4.3     test1            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:24 2013    /bin/sleep 20
  4.4     test1            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:24 2013    /bin/sleep 20
  3.3     test3            Wed Jan 16 11:20:51 2013    00:00:21   C   Wed Jan 16 11:22:25 2013    /bin/sleep 20
  1.3     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:25 2013    /bin/sleep 20
  3.4     test3            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:45 2013    /bin/sleep 20
  1.4     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:45 2013    /bin/sleep 20
  2.0     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:23:05 2013    /bin/sleep 20
  2.1     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:23:24 2013    /bin/sleep 20
  2.2     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:23:24 2013    /bin/sleep 20
  2.3     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:23:24 2013    /bin/sleep 20
  2.4     test5            Wed Jan 16 11:20:51 2013    00:00:21   C   Wed Jan 16 11:23:26 2013    /bin/sleep 20

# plumage_history -c 1
  ID      OWNER            SUBMITTED                   RUN_TIME   ST  COMPLETED/REMOVED           CMD
  1.0     test5            Wed Jan 16 11:20:51 2013    00:00:21   C   Wed Jan 16 11:21:24 2013    /bin/sleep 20
  1.1     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:21:44 2013    /bin/sleep 20
  1.2     test5            Wed Jan 16 11:20:51 2013    00:00:21   C   Wed Jan 16 11:22:05 2013    /bin/sleep 20
  1.3     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:25 2013    /bin/sleep 20
  1.4     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:45 2013    /bin/sleep 20

# plumage_history -o test1
  ID      OWNER            SUBMITTED                   RUN_TIME   ST  COMPLETED/REMOVED           CMD
  4.0     test1            Wed Jan 16 11:20:51 2013    00:00:21   C   Wed Jan 16 11:21:24 2013    /bin/sleep 20
  4.1     test1            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:21:44 2013    /bin/sleep 20
  4.2     test1            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:04 2013    /bin/sleep 20
  4.3     test1            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:24 2013    /bin/sleep 20
  4.4     test1            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:24 2013    /bin/sleep 20

# plumage_history -S 'HOST#1'
  ID      OWNER            SUBMITTED                   RUN_TIME   ST  COMPLETED/REMOVED           CMD
  1.0     test5            Wed Jan 16 11:20:51 2013    00:00:21   C   Wed Jan 16 11:21:24 2013    /bin/sleep 20
  1.1     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:21:44 2013    /bin/sleep 20
  1.2     test5            Wed Jan 16 11:20:51 2013    00:00:21   C   Wed Jan 16 11:22:05 2013    /bin/sleep 20
  1.3     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:25 2013    /bin/sleep 20
  1.4     test5            Wed Jan 16 11:20:51 2013    00:00:20   C   Wed Jan 16 11:22:45 2013    /bin/sleep 20

# plumage_history -l
  Args = 20
  BufferBlockSize = 32768
  BufferSize = 524288
  BytesRecvd = 0.0
  BytesSent = 0.0
  ClusterId = 2
  Cmd = /bin/sleep
    << truncated >>
  WantCheckpoint = False
  WantRemoteIO = True
  WantRemoteSyscalls = False


# plumage_history -l | grep -i globaljobid
  GlobalJobId = HOST#2.4#1358331651
  GlobalJobId = HOST#2.3#1358331651
  GlobalJobId = HOST#2.2#1358331651
  GlobalJobId = HOST#2.1#1358331651
  GlobalJobId = HOST#2.0#1358331651
  GlobalJobId = HOST#1.4#1358331651
  GlobalJobId = HOST#3.4#1358331651
  GlobalJobId = HOST#1.3#1358331651
  GlobalJobId = HOST#3.3#1358331651
  GlobalJobId = HOST#4.4#1358331651
  GlobalJobId = HOST#4.3#1358331651
  GlobalJobId = HOST#1.2#1358331651
  GlobalJobId = HOST#3.2#1358331651
  GlobalJobId = HOST#4.2#1358331651
  GlobalJobId = HOST#1.1#1358331651
  GlobalJobId = HOST#3.1#1358331651
  GlobalJobId = HOST#4.1#1358331651
  GlobalJobId = HOST#3.0#1358331651
  GlobalJobId = HOST#1.0#1358331651
  GlobalJobId = HOST#4.0#1358331651

# plumage_history -l -f | grep -i globaljobid
  GlobalJobId = HOST#4.0#1358331651
  GlobalJobId = HOST#1.0#1358331651
  GlobalJobId = HOST#3.0#1358331651
  GlobalJobId = HOST#4.1#1358331651
  GlobalJobId = HOST#3.1#1358331651
  GlobalJobId = HOST#1.1#1358331651
  GlobalJobId = HOST#4.2#1358331651
  GlobalJobId = HOST#3.2#1358331651
  GlobalJobId = HOST#1.2#1358331651
  GlobalJobId = HOST#4.3#1358331651
  GlobalJobId = HOST#4.4#1358331651
  GlobalJobId = HOST#3.3#1358331651
  GlobalJobId = HOST#1.3#1358331651
  GlobalJobId = HOST#3.4#1358331651
  GlobalJobId = HOST#1.4#1358331651
  GlobalJobId = HOST#2.0#1358331651
  GlobalJobId = HOST#2.1#1358331651
  GlobalJobId = HOST#2.2#1358331651
  GlobalJobId = HOST#2.3#1358331651
  GlobalJobId = HOST#2.4#1358331651

# plumage_history -l -c 1 | grep -i globaljobid
  GlobalJobId = HOST#1.0#1358331651
  GlobalJobId = HOST#1.1#1358331651
  GlobalJobId = HOST#1.2#1358331651
  GlobalJobId = HOST#1.3#1358331651
  GlobalJobId = HOST#1.4#1358331651

# plumage_history -l -o test1 | grep -i globaljobid
  GlobalJobId = HOST#4.0#1358331651
  GlobalJobId = HOST#4.1#1358331651
  GlobalJobId = HOST#4.2#1358331651
  GlobalJobId = HOST#4.3#1358331651
  GlobalJobId = HOST#4.4#1358331651

# plumage_history -l -S 'HOST#1'  | grep -i globaljobid
  GlobalJobId = HOST#1.0#1358331651
  GlobalJobId = HOST#1.1#1358331651
  GlobalJobId = HOST#1.2#1358331651
  GlobalJobId = HOST#1.3#1358331651
  GlobalJobId = HOST#1.4#1358331651


I also check running mongodb server on different machine than condor and it works correctly.

In case of reach mongodb size limit on 32 bit system, message in JobEtLog correctly describe the problem:
# tail -F /var/log/condor/JobEtlLog 
    ...
  01/16/13 11:26:57 mongodb getLastError: can't map file memory - mongo requires 64 bit build for larger datasets
  01/16/13 11:26:57 ODSHistoryFile::poll: unable to write history job ad to ODS for 'JOBID1'
  01/16/13 11:26:57 mongodb getLastError: can't map file memory - mongo requires 64 bit build for larger datasets
  01/16/13 11:26:57 ODSHistoryFile::poll: unable to write history job ad to ODS for 'JOBID2'
  01/16/13 11:26:57 mongodb getLastError: can't map file memory - mongo requires 64 bit build for larger datasets
  01/16/13 11:26:57 ODSHistoryFile::poll: unable to write history job ad to ODS for 'JOBID3'
    ...


[1] Bug 895985 - plumage_history: parameter --forward doesn't work with some other parameters
[2] Bug 888706 - RFE: Authentication and authorization for Plumage, Bug 892767 - Harden mongodb-server default bind_ip

Comment 21 errata-xmlrpc 2013-03-06 18:44:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0564.html