Hide Forgot
Created attachment 490561 [details] "condor_config_val -dump" output and log files with ALL_DEBUG = D_ALL Description of problem: I submit simple job(sleep 10) and call GetJobAd again and again by simple script test.py. It gets classads, after while it gets message: Unknown Job Id (65536) - {} and after another while it gets classads again. This inconsistent state takes about 60s with default configuration of qpidd and qmf. Version-Release number of selected component (if applicable): qpid-cpp-client-0.10-3.el5 condor-aviary-7.6.0-0.5.el5 qpid-tools-0.10-2.el5 condor-7.6.0-0.5.el5 python-qpid-qmf-0.10-4.el5 python-condorutils-1.5-2.el5 condor-wallaby-client-4.0-5.el5 condor-qmf-7.6.0-0.5.el5 condor-wallaby-tools-4.0-5.el5 python-qpid-0.10-1.el5 qpid-cpp-server-0.10-3.el5 qpid-qmf-0.10-4.el5 ruby-qpid-qmf-0.10-4.el5 How reproducible: 100% Steps to Reproduce: 1. install qpid, qmf, condor and configure qmf for condor with JobServer 2. run simple job(for example sleep 10) 3. run python test.py _job_id_(for example 52.0) 4. watch output Actual results: JobServer.GetJobAd gets sometime wrong data. Expected results: JobServer.GetJobAd will get proper classads of any job. Additional info: $ cat test.py: import sys from time import sleep import qmf.console if len(sys.argv) < 2: raise "Not enough parameters." session = qmf.console.Session(); broker = session.addBroker('amqp://cumin/cumin@localhost:5672', 10, 'PLAIN'); for i in range(10): if broker.isConnected(): break; else: sleep(1); parents = session.getObjects(_class="jobserver"); parent = parents[0]; while True: result = parent.GetJobAd(sys.argv[1]); print result; if result.status == 0 and result.outArgs[u'JobAd']['JobStatus'] == 4: break sleep(5) session.delBroker(broker); session.close();
Sounds like the transition period between the live job destruction and the history job creation. But the code should account for that. Will need detailed job server logging for this. Please rerun with: JOB_SERVER.JOB_SERVER_DEBUG = D_FULLDEBUG and value of HISTORY_INTERVAL
By design, the job server doesn't retain the live classad in memory for size and performance considerations. A user can access the live classad as long as it hasn't been destroyed from the job queue log. Once that happens, the job will be archived to the history file. From there it will *eventually* be loaded back in to memory with a much smaller footprint than that of the live job. There is no atomic transaction that moves the job out of the job queue to the history file. So, the job from the QMF API perspective appears to "flicker". The test has a very short job lifetime (10 sec) coupled with the default history scanning interval of 120 seconds. In this particular test, using a HISTORY_INTERVAL of 13 with a sleep job of 30 doesn't exhibit the described problem (i.e., non-zero modulo) Data collection of jobs (live and historical) is likely to change in the future.