Bug 694504 - jobserver GetJobAd doesn't know about existing job
Summary: jobserver GetJobAd doesn't know about existing job
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor-qmf
Version: Development
Hardware: Unspecified
OS: Linux
medium
medium
Target Milestone: 2.0
: ---
Assignee: Pete MacKinnon
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-04-07 14:17 UTC by Martin Kudlej
Modified: 2011-05-05 14:44 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-05-05 14:44:26 UTC
Target Upstream Version:


Attachments (Terms of Use)
"condor_config_val -dump" output and log files with ALL_DEBUG = D_ALL (503.98 KB, application/x-gzip)
2011-04-07 14:17 UTC, Martin Kudlej
no flags Details

Description Martin Kudlej 2011-04-07 14:17:08 UTC
Created attachment 490561 [details]
"condor_config_val -dump" output and log files with ALL_DEBUG = D_ALL

Description of problem:
I submit simple job(sleep 10) and call GetJobAd again and again by simple script test.py.
It gets classads, after while it gets message:
Unknown Job Id (65536) - {}
and after another while it gets classads again.
This inconsistent state takes about 60s with default configuration of qpidd and qmf.

Version-Release number of selected component (if applicable):
qpid-cpp-client-0.10-3.el5
condor-aviary-7.6.0-0.5.el5
qpid-tools-0.10-2.el5
condor-7.6.0-0.5.el5
python-qpid-qmf-0.10-4.el5
python-condorutils-1.5-2.el5
condor-wallaby-client-4.0-5.el5
condor-qmf-7.6.0-0.5.el5
condor-wallaby-tools-4.0-5.el5
python-qpid-0.10-1.el5
qpid-cpp-server-0.10-3.el5
qpid-qmf-0.10-4.el5
ruby-qpid-qmf-0.10-4.el5


How reproducible:
100%

Steps to Reproduce:
1. install qpid, qmf, condor and configure qmf for condor with JobServer
2. run simple job(for example sleep 10)
3. run python test.py _job_id_(for example 52.0)
4. watch output
  
Actual results:
JobServer.GetJobAd gets sometime wrong data.

Expected results:
JobServer.GetJobAd will get proper classads of any job.

Additional info:
$ cat test.py:
import sys
from time import sleep
import qmf.console

if len(sys.argv) < 2:
  raise "Not enough parameters."

session = qmf.console.Session();
broker = session.addBroker('amqp://cumin/cumin@localhost:5672', 10, 'PLAIN');
for i in range(10):
  if broker.isConnected():
    break;
  else:
    sleep(1);

parents = session.getObjects(_class="jobserver");
parent = parents[0];

while True:
  result = parent.GetJobAd(sys.argv[1]);
  print result;
  if result.status == 0 and result.outArgs[u'JobAd']['JobStatus'] == 4:
    break

  sleep(5)


session.delBroker(broker);
session.close();

Comment 1 Pete MacKinnon 2011-04-12 20:25:03 UTC
Sounds like the transition period between the live job destruction and the history job creation. But the code should account for that.

Will need detailed job server logging for this. Please rerun with:

JOB_SERVER.JOB_SERVER_DEBUG = D_FULLDEBUG

and value of HISTORY_INTERVAL

Comment 4 Pete MacKinnon 2011-05-05 14:44:26 UTC
By design, the job server doesn't retain the live classad in memory for size and performance considerations. A user can access the live classad as long as it hasn't been destroyed from the job queue log. Once that happens, the job will be archived to the history file. From there it will *eventually* be loaded back in to memory with a much smaller footprint than that of the live job.

There is no atomic transaction that moves the job out of the job queue to the history file. So, the job from the QMF API perspective appears to "flicker".

The test has a very short job lifetime (10 sec) coupled with the default history scanning interval of 120 seconds. In this particular test, using a HISTORY_INTERVAL of 13 with a sleep job of 30 doesn't exhibit the described problem (i.e., non-zero modulo)

Data collection of jobs (live and historical) is likely to change in the future.


Note You need to log in before you can comment on or make changes to this bug.