Bug 733495

Summary: Ensure that duplicate entries of cluster.proc in history can be detected across submissions
Product: Red Hat Enterprise MRG Reporter: Pete MacKinnon <pmackinn>
Component: condor-qmfAssignee: grid-maint-list <grid-maint-list>
Status: CLOSED WONTFIX QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: low Docs Contact:
Priority: low    
Version: DevelopmentCC: iboverma, ltrilety, matt, mkudlej, tstclair
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-26 20:14:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Pete MacKinnon 2011-08-25 19:40:01 UTC
Jobs are maintained in a global lookup keyed by their cluster.proc id. These can periodically roll over leading to dupes which could be masked under the wrong submission group...

"The missing submissions appear to be ones who have a cluster+proc 
already present in the history file. Lookup of job information is on 
cluster+proc, so the first occurrence swallows subsequent occurrences.

Consider this history file...

--
Cmd = "/bin/sleep"
ClusterId = 1
QDate = 1314044143
Args = "1d"
Owner = "matt"
EnteredCurrentStatus = 1314044328
Submission = "ONE"
JobStatus = 3
GlobalJobId = "matt@eeyore#1.0#1314044148"
ProcId = 0
LastJobStatus = 5
***
Cmd = "/bin/sleep"
ClusterId = 1
QDate = 1314044143
Args = "1d"
Owner = "matt"
EnteredCurrentStatus = 1314044328
Submission = "TWO"
JobStatus = 3
GlobalJobId = "matt@eeyore#1.0#1314044148"
ProcId = 0
LastJobStatus = 5
***
--

Only submission ONE will appear. The job_server reports...

--
08/23/11 07:20:55 ProcessHistoryTimer() called
08/23/11 07:20:55 HistoryJobImpl created for '1.0'
08/23/11 07:20:55 Job::Job of '1.0'
08/23/11 07:20:55 Created new SubmissionObject 'ONE' for 'matt'
08/23/11 07:20:55 SubmissionObject::Increment 'REMOVED' on '1.0'
08/23/11 07:20:55 HistoryJobImpl created for '1.0'
08/23/11 07:20:55 HistoryJobImpl created for '1.0'
08/23/11 07:20:55 HistoryJobImpl destroyed: key '1.0'
08/23/11 07:20:55 HistoryJobImpl added to '1.0'
--

Seem reasonable? I'd like to consider it NOTABUG - GIGO. Except 
SCHEDD_CLUSTER_MAXIMUM_VALUE could mean duplicate cluster+proc in the 
history file. BZ for future fix? Likely also important for ODS.

Comment 1 Pete MacKinnon 2011-08-25 19:40:25 UTC
Issue for aviary query server also.

Comment 2 Pete MacKinnon 2011-08-25 19:44:16 UTC
Possibly related to Bug 732452.

Comment 3 Pete MacKinnon 2011-10-24 22:15:57 UTC
You get other strange behavior with a shortened SCHEDD_CLUSTER_MAXIMUM_VALUE, like jobs that appear to be stuck running in the queue after a schedd restart:

-- Submitter: pmackinn.redhat.com : <192.168.1.131:48084> : milo.usersys.redhat.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   pmackinn       10/24 17:40   0+00:12:43 R  0   0.0  sleep 120         
   2.0   pmackinn       10/24 17:40   0+00:12:43 R  0   0.0  sleep 120         

2 jobs; 0 idle, 2 running, 0 held

And if you try to go back from SCHEDD_CLUSTER_MAXIMUM_VALUE=3 the schedd gets jammed unless the queue is cleaned out:

10/24/11 18:13:44 (pid:13483) ERROR "JOB QUEUE DAMAGED; header ad NEXT_CLUSTER_NUM invalid" at line 1088 in file /home/pmackinn/repos/uw/condor/CONDOR_SRC/src/condor_schedd.V6/qmgmt.cpp

Comment 4 Pete MacKinnon 2011-10-25 20:19:28 UTC
Using the global job id as a key in the jobs map solves this but that would break the current Aviary get* API since the user would have to provide the GJID instead of the simpler cluster.proc, i.e., regexp on some part of 'scheduler#c.p' and return all matches...?

Comment 5 Pete MacKinnon 2011-10-26 14:56:59 UTC
Multimap approach could be used to address this in the implementation but again there is potential Aviary API impact.

Comment 6 Anne-Louise Tangring 2016-05-26 20:14:18 UTC
MRG-Grid is in maintenance and only customer escalations will be considered. This issue can be reopened if a customer escalation associated with it occurs.