The Job Server cleans its job and submission data structures and deletes the objects within them whenever a Reset occurs on the JobServerJobLogConsumer. A Reset can be produced under multiple scenarios such as a daily timer or some incomplete record being probed in the schedd's job queue log. The Reset cleaning is necessary to re-populate the jobs and submissions from the live and historical sources. The test sets a frequent Reset interval on the Job Server (say 30 sec) with sleep jobs of a similar random interval around 30 sec being added simultaeously while QMF calls for GetJobSummaries. SEGV can randomly occur at: (gdb) where #0 0x080b002f in SubmissionObject::GetJobSummaries (this=0x82cfe38, jobs=std::list = {...}, text="") at SubmissionObject.cpp:268 #1 0x080a98b2 in qmf::com::redhat::grid::Submission::doMethod (this=0x82cfec0, methodName="GetJobSummaries", inMap= std::map with 0 elements, outMap=std::map with 0 elements) at Submission.cpp:318 #2 0x00b3f781 in qpid::management::ManagementAgentImpl::invokeMethodRequest (this=0xb7fb3008, body= "\000\000\000\311\000\000\000\003\n_object_id\250\000\000\000\202\000\000\000\002\f_object_name\225\000\"pmackinn#147\v_agent_name\225\000=com.redhat.grid:jobserver:jobs@pmackinn\f_method_name\225\000\017GetJobSummaries\n_arguments\250\000\000\000\004\000\000\000", cid="83", replyTo="qmfc-v2-pmackinn-fedora.13972.1") at qpid/agent/ManagementAgentImpl.cpp:568 #3 0x00b4627e in qpid::management::ManagementAgentImpl::pollCallbacks (this=0xb7fb3008, callLimit=0) at qpid/agent/ManagementAgentImpl.cpp:359 #4 0x080aa8eb in HandleMgmtSocket () at job_server_main.cpp:314 #5 0x080ca9b3 in DaemonCore::CallSocketHandler_worker (this=0x81dddf0, i=2, default_to_HandleCommand=true, asock=0x0) at daemon_core.cpp:3501 #6 0x080caf0c in DaemonCore::CallSocketHandler_worker_demarshall (arg=0x82b8740) at daemon_core.cpp:3468 #7 0x0810fb97 in CondorThreads::pool_add (routine=0x80caee0 <DaemonCore::CallSocketHandler_worker_demarshall(void*)>, arg=0x82b8740, tid=0x81e1e2c, descrip=0x82025a8 "Handler for Mgmt Methods.") at condor_threads.cpp:1106 #8 0x080c37e7 in DaemonCore::CallSocketHandler (this=0x81dddf0, i=@0xbfffe83c, default_to_HandleCommand=true) at daemon_core.cpp:3456 #9 0x080c5ea1 in DaemonCore::Driver (this=0x81dddf0) at daemon_core.cpp:3347 #10 0x080d24e5 in main (argc=1, argv=0xbfffee5c) at daemon_core_main.cpp:2295 Problem is believed to be attributed to job/submission decrements that were removed for correct job totals. However these decrements also act on the submission job state std::sets. (This implementation needs reworking)
FH sha d96a25fba in V7_4-QMF-branch Unit testing of 10 submissions with random async Resets show no SEGVs on QMF job server queries in parallel. There is still "catch up" that occurs since we wipe out all jobs, clusters and submission objects on a Reset. Jobs already in the JQL do not trigger a NewClassAd call from the JQL reader. Thus, we have to wait for the history jobs to coalesce. The infrastructure is in place to make this more seamless but needs further discussion. Resets can be triggered using "pkill -10 condor_job_serv". jobinventory.py jobserver:jobs@<your JS name> submissions.py jobserver:jobs@<your JS name> can be used to do the QMF queries.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause - Condor issues a reset of the job queue log file on a daily timer or if there was a problem probing the file. Consequence - A QMF query for job submission information could lead to a SEGV of the job server after the reset. Fix - Significant refactoring of job server internals to ensure no interface usage of deleted C++ objects. Result - Job server no longer SEGV after a reset followed by QMF queries.
Created attachment 473024 [details] Python test script to scan submissions from QMF Reset can be triggered by sending SIGUSR1 (10) to job server process
Created attachment 473031 [details] Py script to match condor jobs to jobs listed in QMF
Created attachment 473033 [details] Script to print out details of every grid QMF object on a broker
Tomas, can you try to repro with condor-qmf-7.4.4-0.16 instead of -17? There were interim code changes in -17 that may be masking (but not correcting) the original bug.
QUEUE_CLEAN_INTERVAL was set to 5 determining the Reset frequency
Reproduced on: $CondorVersion: 7.4.4 Sep 27 2010 BuildID: RH-7.4.4-0.16.el5 PRE-RELEASE $ $CondorPlatform: I386-LINUX_RHEL5 $ Config: JOB_SERVER_LOG=/var/log/condor/jobserver.log QMF_PUBLISH_SUBMISSIONS=FALSE USE_PROCD = False COLLECTOR_QUERY_WORKERS = 0 SCHEDD_QUERY_WORKERS = 0 SCHEDD_INTERVAL = 5 HISTORY_INTERVAL = 30 QUEUE_CLEAN_INTERVAL = 5 ALL_DEBUG=D_FULLDEBUG # cat rhelprog.sub Universe = vanilla Executable = /bin/sleep Arguments = $RANDOM_INTEGER(5,30,1) should_transfer_files = false Queue $RANDOM_INTEGER(1,5,1) Run in loop: python submissions.py jobserver:jobs@localhost JobServer.log: 1/19/11 05:06:06 LiveJob destroyed: key '829.1' 01/19/11 05:06:06 LiveJob destroyed: key '830.1' 01/19/11 05:06:06 LiveJob destroyed: key '830.2' Stack dump for process 15451 at timestamp 1295431566 (15 frames) condor_job_server(dprintf_dump_stack+0x44)[0x80df9c4] condor_job_server[0x80e1804] [0x455420] condor_job_server(_ZNSt8_Rb_treeIPK3JobS2_St9_IdentityIS2_E6cmpjobSaIS2_EE5eraseERKS2_+0x28)[0x80b6eb8] condor_job_server(_ZN16SubmissionObject9DecrementEPK3Job+0x128)[0x80b6568] condor_job_server(_ZN3Job19DecrementSubmissionEv+0x13)[0x80b15e3] condor_job_server(_ZN7LiveJobD0Ev+0x2f)[0x80b161f] condor_job_server(_ZN23JobServerJobLogConsumer5ResetEv+0x54)[0x80b71a4] condor_job_server(_ZN12JobLogReader8BulkLoadEv+0x21)[0x8110701] condor_job_server(_ZN12JobLogReader4PollEv+0xa9)[0x81107c9] condor_job_server(_ZN12TimerManager7TimeoutEv+0x14b)[0x80def2b] condor_job_server(_ZN10DaemonCore6DriverEv+0x244)[0x80c7784] condor_job_server(main+0xd80)[0x80db320] /lib/libc.so.6(__libc_start_main+0xdc)[0x553e9c] condor_job_server[0x80a2ed1]
Retested on: $CondorVersion: 7.4.5 Jan 4 2011 BuildID: RH-7.4.5-0.6.el5 PRE-RELEASE $ $CondorPlatform: I386-LINUX_RHEL5 $ $CondorVersion: 7.4.5 Jan 4 2011 BuildID: RH-7.4.5-0.6.el5 PRE-RELEASE $ $CondorPlatform: X86_64-LINUX_RHEL5 $ $CondorVersion: 7.4.5 Jan 4 2011 BuildID: RH-7.4.5-0.6.el4 PRE-RELEASE $ $CondorPlatform: I386-LINUX_RHEL4 $ $CondorVersion: 7.4.5 Jan 4 2011 BuildID: RH-7.4.5-0.6.el4 PRE-RELEASE $ $CondorPlatform: X86_64-LINUX_RHEL4 $ Config: JOB_SERVER_LOG=/var/log/condor/jobserver.log QMF_PUBLISH_SUBMISSIONS=FALSE USE_PROCD = False COLLECTOR_QUERY_WORKERS = 0 SCHEDD_QUERY_WORKERS = 0 SCHEDD_INTERVAL = 5 HISTORY_INTERVAL = 30 QUEUE_CLEAN_INTERVAL = 5 ALL_DEBUG=D_FULLDEBUG JobServer.log: 01/19/11 09:19:04 SubmissionObject::Decrement 'IDLE' on '19.0' 01/19/11 09:19:04 LiveJobImpl destroyed: key '19.0' 01/19/11 09:19:04 Job::~Job of '19.1' 01/19/11 09:19:04 SubmissionObject::Decrement 'IDLE' on '19.1' 01/19/11 09:19:04 LiveJobImpl destroyed: key '19.1' 01/19/11 09:19:04 Job::~Job of '19.2' 01/19/11 09:19:04 SubmissionObject::Decrement 'IDLE' on '19.2' 01/19/11 09:19:04 LiveJobImpl destroyed: key '19.2' 01/19/11 09:19:04 Job::~Job of '19.3' 01/19/11 09:19:04 SubmissionObject::Decrement 'IDLE' on '19.3' 01/19/11 09:19:04 LiveJobImpl destroyed: key '19.3' 01/19/11 09:19:04 Job::~Job of '19.4' 01/19/11 09:19:04 SubmissionObject::Decrement 'IDLE' on '19.4' 01/19/11 09:19:04 LiveJobImpl destroyed: key '19.4' MaxLog = 1048576, length = 0 Saving log file to "/var/log/condor/jobserver.log.old" No regression found. 20 and 40 random jobs tested with concurrent submissions.py in a loop running. >>> VERIFIED
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,7 +1 @@ -Cause - Condor issues a reset of the job queue log file on a daily timer or if there was a problem probing the file. +Prior to this update, Condor issued a reset of the job queue log file on a daily basis or if there was a problem probing the file. As a result, after the reset, a QMF (Quality Management Framework) query for job submission information could lead to a segmentation fault at the job server. With this update, significant refactoring of the job server internals ensure that no interface uses any deleted C++ objects, thus, the job server no longer returns a segmentation fault after a reset followed by QMF queries.- -Consequence - A QMF query for job submission information could lead to a SEGV of the job server after the reset. - -Fix - Significant refactoring of job server internals to ensure no interface usage of deleted C++ objects. - -Result - Job server no longer SEGV after a reset followed by QMF queries.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0217.html