Bug 694504

Summary: jobserver GetJobAd doesn't know about existing job
Product: Red Hat Enterprise MRG Reporter: Martin Kudlej <mkudlej>
Component: condor-qmfAssignee: Pete MacKinnon <pmackinn>
Status: CLOSED WONTFIX QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: DevelopmentCC: iboverma, jneedle, matt
Target Milestone: 2.0   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-05 14:44:26 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
"condor_config_val -dump" output and log files with ALL_DEBUG = D_ALL none

Description Martin Kudlej 2011-04-07 14:17:08 UTC
Created attachment 490561 [details]
"condor_config_val -dump" output and log files with ALL_DEBUG = D_ALL

Description of problem:
I submit simple job(sleep 10) and call GetJobAd again and again by simple script test.py.
It gets classads, after while it gets message:
Unknown Job Id (65536) - {}
and after another while it gets classads again.
This inconsistent state takes about 60s with default configuration of qpidd and qmf.

Version-Release number of selected component (if applicable):
qpid-cpp-client-0.10-3.el5
condor-aviary-7.6.0-0.5.el5
qpid-tools-0.10-2.el5
condor-7.6.0-0.5.el5
python-qpid-qmf-0.10-4.el5
python-condorutils-1.5-2.el5
condor-wallaby-client-4.0-5.el5
condor-qmf-7.6.0-0.5.el5
condor-wallaby-tools-4.0-5.el5
python-qpid-0.10-1.el5
qpid-cpp-server-0.10-3.el5
qpid-qmf-0.10-4.el5
ruby-qpid-qmf-0.10-4.el5


How reproducible:
100%

Steps to Reproduce:
1. install qpid, qmf, condor and configure qmf for condor with JobServer
2. run simple job(for example sleep 10)
3. run python test.py _job_id_(for example 52.0)
4. watch output
  
Actual results:
JobServer.GetJobAd gets sometime wrong data.

Expected results:
JobServer.GetJobAd will get proper classads of any job.

Additional info:
$ cat test.py:
import sys
from time import sleep
import qmf.console

if len(sys.argv) < 2:
  raise "Not enough parameters."

session = qmf.console.Session();
broker = session.addBroker('amqp://cumin/cumin@localhost:5672', 10, 'PLAIN');
for i in range(10):
  if broker.isConnected():
    break;
  else:
    sleep(1);

parents = session.getObjects(_class="jobserver");
parent = parents[0];

while True:
  result = parent.GetJobAd(sys.argv[1]);
  print result;
  if result.status == 0 and result.outArgs[u'JobAd']['JobStatus'] == 4:
    break

  sleep(5)


session.delBroker(broker);
session.close();

Comment 1 Pete MacKinnon 2011-04-12 20:25:03 UTC
Sounds like the transition period between the live job destruction and the history job creation. But the code should account for that.

Will need detailed job server logging for this. Please rerun with:

JOB_SERVER.JOB_SERVER_DEBUG = D_FULLDEBUG

and value of HISTORY_INTERVAL

Comment 4 Pete MacKinnon 2011-05-05 14:44:26 UTC
By design, the job server doesn't retain the live classad in memory for size and performance considerations. A user can access the live classad as long as it hasn't been destroyed from the job queue log. Once that happens, the job will be archived to the history file. From there it will *eventually* be loaded back in to memory with a much smaller footprint than that of the live job.

There is no atomic transaction that moves the job out of the job queue to the history file. So, the job from the QMF API perspective appears to "flicker".

The test has a very short job lifetime (10 sec) coupled with the default history scanning interval of 120 seconds. In this particular test, using a HISTORY_INTERVAL of 13 with a sleep job of 30 doesn't exhibit the described problem (i.e., non-zero modulo)

Data collection of jobs (live and historical) is likely to change in the future.