Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 694504

Summary:

jobserver GetJobAd doesn't know about existing job

Product:

Red Hat Enterprise MRG

Reporter:

Martin Kudlej <mkudlej>

Component:

condor-qmf

Assignee:

Pete MacKinnon <pmackinn>

Status:

CLOSED WONTFIX

QA Contact:

MRG Quality Engineering <mrgqe-bugs>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

Development

CC:

iboverma, jneedle, matt

Target Milestone:

2.0

Target Release:

---

Hardware:

Unspecified

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-05-05 14:44:26 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
"condor_config_val -dump" output and log files with ALL_DEBUG = D_ALL	none

Description Martin Kudlej 2011-04-07 14:17:08 UTC

Created attachment 490561 [details]
"condor_config_val -dump" output and log files with ALL_DEBUG = D_ALL

Description of problem:
I submit simple job(sleep 10) and call GetJobAd again and again by simple script test.py.
It gets classads, after while it gets message:
Unknown Job Id (65536) - {}
and after another while it gets classads again.
This inconsistent state takes about 60s with default configuration of qpidd and qmf.

Version-Release number of selected component (if applicable):
qpid-cpp-client-0.10-3.el5
condor-aviary-7.6.0-0.5.el5
qpid-tools-0.10-2.el5
condor-7.6.0-0.5.el5
python-qpid-qmf-0.10-4.el5
python-condorutils-1.5-2.el5
condor-wallaby-client-4.0-5.el5
condor-qmf-7.6.0-0.5.el5
condor-wallaby-tools-4.0-5.el5
python-qpid-0.10-1.el5
qpid-cpp-server-0.10-3.el5
qpid-qmf-0.10-4.el5
ruby-qpid-qmf-0.10-4.el5


How reproducible:
100%

Steps to Reproduce:
1. install qpid, qmf, condor and configure qmf for condor with JobServer
2. run simple job(for example sleep 10)
3. run python test.py _job_id_(for example 52.0)
4. watch output
  
Actual results:
JobServer.GetJobAd gets sometime wrong data.

Expected results:
JobServer.GetJobAd will get proper classads of any job.

Additional info:
$ cat test.py:
import sys
from time import sleep
import qmf.console

if len(sys.argv) < 2:
  raise "Not enough parameters."

session = qmf.console.Session();
broker = session.addBroker('amqp://cumin/cumin@localhost:5672', 10, 'PLAIN');
for i in range(10):
  if broker.isConnected():
    break;
  else:
    sleep(1);

parents = session.getObjects(_class="jobserver");
parent = parents[0];

while True:
  result = parent.GetJobAd(sys.argv[1]);
  print result;
  if result.status == 0 and result.outArgs[u'JobAd']['JobStatus'] == 4:
    break

  sleep(5)


session.delBroker(broker);
session.close();

Comment 1 Pete MacKinnon 2011-04-12 20:25:03 UTC

Sounds like the transition period between the live job destruction and the history job creation. But the code should account for that.

Will need detailed job server logging for this. Please rerun with:

JOB_SERVER.JOB_SERVER_DEBUG = D_FULLDEBUG

and value of HISTORY_INTERVAL

Comment 4 Pete MacKinnon 2011-05-05 14:44:26 UTC

By design, the job server doesn't retain the live classad in memory for size and performance considerations. A user can access the live classad as long as it hasn't been destroyed from the job queue log. Once that happens, the job will be archived to the history file. From there it will *eventually* be loaded back in to memory with a much smaller footprint than that of the live job.

There is no atomic transaction that moves the job out of the job queue to the history file. So, the job from the QMF API perspective appears to "flicker".

The test has a very short job lifetime (10 sec) coupled with the default history scanning interval of 120 seconds. In this particular test, using a HISTORY_INTERVAL of 13 with a sleep job of 30 doesn't exhibit the described problem (i.e., non-zero modulo)

Data collection of jobs (live and historical) is likely to change in the future.