Bug 733495 - Ensure that duplicate entries of cluster.proc in history can be detected across submissions
Summary: Ensure that duplicate entries of cluster.proc in history can be detected acro...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor-qmf
Version: Development
Hardware: All
OS: Linux
low
low
Target Milestone: ---
: ---
Assignee: grid-maint-list
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-08-25 19:40 UTC by Pete MacKinnon
Modified: 2016-05-26 20:14 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-26 20:14:18 UTC


Attachments (Terms of Use)

Description Pete MacKinnon 2011-08-25 19:40:01 UTC
Jobs are maintained in a global lookup keyed by their cluster.proc id. These can periodically roll over leading to dupes which could be masked under the wrong submission group...

"The missing submissions appear to be ones who have a cluster+proc 
already present in the history file. Lookup of job information is on 
cluster+proc, so the first occurrence swallows subsequent occurrences.

Consider this history file...

--
Cmd = "/bin/sleep"
ClusterId = 1
QDate = 1314044143
Args = "1d"
Owner = "matt"
EnteredCurrentStatus = 1314044328
Submission = "ONE"
JobStatus = 3
GlobalJobId = "matt@eeyore#1.0#1314044148"
ProcId = 0
LastJobStatus = 5
***
Cmd = "/bin/sleep"
ClusterId = 1
QDate = 1314044143
Args = "1d"
Owner = "matt"
EnteredCurrentStatus = 1314044328
Submission = "TWO"
JobStatus = 3
GlobalJobId = "matt@eeyore#1.0#1314044148"
ProcId = 0
LastJobStatus = 5
***
--

Only submission ONE will appear. The job_server reports...

--
08/23/11 07:20:55 ProcessHistoryTimer() called
08/23/11 07:20:55 HistoryJobImpl created for '1.0'
08/23/11 07:20:55 Job::Job of '1.0'
08/23/11 07:20:55 Created new SubmissionObject 'ONE' for 'matt'
08/23/11 07:20:55 SubmissionObject::Increment 'REMOVED' on '1.0'
08/23/11 07:20:55 HistoryJobImpl created for '1.0'
08/23/11 07:20:55 HistoryJobImpl created for '1.0'
08/23/11 07:20:55 HistoryJobImpl destroyed: key '1.0'
08/23/11 07:20:55 HistoryJobImpl added to '1.0'
--

Seem reasonable? I'd like to consider it NOTABUG - GIGO. Except 
SCHEDD_CLUSTER_MAXIMUM_VALUE could mean duplicate cluster+proc in the 
history file. BZ for future fix? Likely also important for ODS.

Comment 1 Pete MacKinnon 2011-08-25 19:40:25 UTC
Issue for aviary query server also.

Comment 2 Pete MacKinnon 2011-08-25 19:44:16 UTC
Possibly related to Bug 732452.

Comment 3 Pete MacKinnon 2011-10-24 22:15:57 UTC
You get other strange behavior with a shortened SCHEDD_CLUSTER_MAXIMUM_VALUE, like jobs that appear to be stuck running in the queue after a schedd restart:

-- Submitter: pmackinn@milo.usersys.redhat.com : <192.168.1.131:48084> : milo.usersys.redhat.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   pmackinn       10/24 17:40   0+00:12:43 R  0   0.0  sleep 120         
   2.0   pmackinn       10/24 17:40   0+00:12:43 R  0   0.0  sleep 120         

2 jobs; 0 idle, 2 running, 0 held

And if you try to go back from SCHEDD_CLUSTER_MAXIMUM_VALUE=3 the schedd gets jammed unless the queue is cleaned out:

10/24/11 18:13:44 (pid:13483) ERROR "JOB QUEUE DAMAGED; header ad NEXT_CLUSTER_NUM invalid" at line 1088 in file /home/pmackinn/repos/uw/condor/CONDOR_SRC/src/condor_schedd.V6/qmgmt.cpp

Comment 4 Pete MacKinnon 2011-10-25 20:19:28 UTC
Using the global job id as a key in the jobs map solves this but that would break the current Aviary get* API since the user would have to provide the GJID instead of the simpler cluster.proc, i.e., regexp on some part of 'scheduler#c.p' and return all matches...?

Comment 5 Pete MacKinnon 2011-10-26 14:56:59 UTC
Multimap approach could be used to address this in the implementation but again there is potential Aviary API impact.

Comment 6 Anne-Louise Tangring 2016-05-26 20:14:18 UTC
MRG-Grid is in maintenance and only customer escalations will be considered. This issue can be reopened if a customer escalation associated with it occurs.


Note You need to log in before you can comment on or make changes to this bug.