Hide Forgot
Jobs are maintained in a global lookup keyed by their cluster.proc id. These can periodically roll over leading to dupes which could be masked under the wrong submission group... "The missing submissions appear to be ones who have a cluster+proc already present in the history file. Lookup of job information is on cluster+proc, so the first occurrence swallows subsequent occurrences. Consider this history file... -- Cmd = "/bin/sleep" ClusterId = 1 QDate = 1314044143 Args = "1d" Owner = "matt" EnteredCurrentStatus = 1314044328 Submission = "ONE" JobStatus = 3 GlobalJobId = "matt@eeyore#1.0#1314044148" ProcId = 0 LastJobStatus = 5 *** Cmd = "/bin/sleep" ClusterId = 1 QDate = 1314044143 Args = "1d" Owner = "matt" EnteredCurrentStatus = 1314044328 Submission = "TWO" JobStatus = 3 GlobalJobId = "matt@eeyore#1.0#1314044148" ProcId = 0 LastJobStatus = 5 *** -- Only submission ONE will appear. The job_server reports... -- 08/23/11 07:20:55 ProcessHistoryTimer() called 08/23/11 07:20:55 HistoryJobImpl created for '1.0' 08/23/11 07:20:55 Job::Job of '1.0' 08/23/11 07:20:55 Created new SubmissionObject 'ONE' for 'matt' 08/23/11 07:20:55 SubmissionObject::Increment 'REMOVED' on '1.0' 08/23/11 07:20:55 HistoryJobImpl created for '1.0' 08/23/11 07:20:55 HistoryJobImpl created for '1.0' 08/23/11 07:20:55 HistoryJobImpl destroyed: key '1.0' 08/23/11 07:20:55 HistoryJobImpl added to '1.0' -- Seem reasonable? I'd like to consider it NOTABUG - GIGO. Except SCHEDD_CLUSTER_MAXIMUM_VALUE could mean duplicate cluster+proc in the history file. BZ for future fix? Likely also important for ODS.
Issue for aviary query server also.
Possibly related to Bug 732452.
You get other strange behavior with a shortened SCHEDD_CLUSTER_MAXIMUM_VALUE, like jobs that appear to be stuck running in the queue after a schedd restart: -- Submitter: pmackinn.redhat.com : <192.168.1.131:48084> : milo.usersys.redhat.com ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 pmackinn 10/24 17:40 0+00:12:43 R 0 0.0 sleep 120 2.0 pmackinn 10/24 17:40 0+00:12:43 R 0 0.0 sleep 120 2 jobs; 0 idle, 2 running, 0 held And if you try to go back from SCHEDD_CLUSTER_MAXIMUM_VALUE=3 the schedd gets jammed unless the queue is cleaned out: 10/24/11 18:13:44 (pid:13483) ERROR "JOB QUEUE DAMAGED; header ad NEXT_CLUSTER_NUM invalid" at line 1088 in file /home/pmackinn/repos/uw/condor/CONDOR_SRC/src/condor_schedd.V6/qmgmt.cpp
Using the global job id as a key in the jobs map solves this but that would break the current Aviary get* API since the user would have to provide the GJID instead of the simpler cluster.proc, i.e., regexp on some part of 'scheduler#c.p' and return all matches...?
Multimap approach could be used to address this in the implementation but again there is potential Aviary API impact.
MRG-Grid is in maintenance and only customer escalations will be considered. This issue can be reopened if a customer escalation associated with it occurs.