Bug 634383 - jobserver: frequent Resets of JobServerJobLogConsumer lead to SEGV
Summary: jobserver: frequent Resets of JobServerJobLogConsumer lead to SEGV
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor-qmf
Version: 1.3
Hardware: All
OS: Linux
medium
high
Target Milestone: 1.3.2
: ---
Assignee: Pete MacKinnon
QA Contact: Tomas Rusnak
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-09-16 00:21 UTC by Pete MacKinnon
Modified: 2011-02-15 12:14 UTC (History)
2 users (show)

Fixed In Version: condor-7.4.5-0.5
Doc Type: Bug Fix
Doc Text:
Prior to this update, Condor issued a reset of the job queue log file on a daily basis or if there was a problem probing the file. As a result, after the reset, a QMF (Quality Management Framework) query for job submission information could lead to a segmentation fault at the job server. With this update, significant refactoring of the job server internals ensure that no interface uses any deleted C++ objects, thus, the job server no longer returns a segmentation fault after a reset followed by QMF queries.
Clone Of:
Environment:
Last Closed: 2011-02-15 12:14:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Python test script to scan submissions from QMF (946 bytes, text/x-python)
2011-01-12 13:58 UTC, Pete MacKinnon
no flags Details
Py script to match condor jobs to jobs listed in QMF (1.69 KB, text/x-python)
2011-01-12 14:12 UTC, Pete MacKinnon
no flags Details
Script to print out details of every grid QMF object on a broker (1.06 KB, text/x-python)
2011-01-12 14:14 UTC, Pete MacKinnon
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0217 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging and Grid bug fix and enhancement update 2011-02-15 12:10:15 UTC

Description Pete MacKinnon 2010-09-16 00:21:28 UTC
The Job Server cleans its job and submission data structures and deletes the objects within them whenever a Reset occurs on the JobServerJobLogConsumer. A Reset can be produced under multiple scenarios such as a daily timer or some incomplete record being probed in the schedd's job queue log. The Reset cleaning is necessary to re-populate the jobs and submissions from the live and historical sources.

The test sets a frequent Reset interval on the Job Server (say 30 sec) with sleep jobs of a similar random interval around 30 sec being added simultaeously while QMF calls for GetJobSummaries.

SEGV can randomly occur at:

(gdb) where
#0  0x080b002f in SubmissionObject::GetJobSummaries (this=0x82cfe38, jobs=std::list = {...}, text="") at SubmissionObject.cpp:268
#1  0x080a98b2 in qmf::com::redhat::grid::Submission::doMethod (this=0x82cfec0, methodName="GetJobSummaries", inMap=
    std::map with 0 elements, outMap=std::map with 0 elements) at Submission.cpp:318
#2  0x00b3f781 in qpid::management::ManagementAgentImpl::invokeMethodRequest (this=0xb7fb3008, body=
    "\000\000\000\311\000\000\000\003\n_object_id\250\000\000\000\202\000\000\000\002\f_object_name\225\000\"pmackinn#147\v_agent_name\225\000=com.redhat.grid:jobserver:jobs@pmackinn\f_method_name\225\000\017GetJobSummaries\n_arguments\250\000\000\000\004\000\000\000", cid="83", replyTo="qmfc-v2-pmackinn-fedora.13972.1")
    at qpid/agent/ManagementAgentImpl.cpp:568
#3  0x00b4627e in qpid::management::ManagementAgentImpl::pollCallbacks (this=0xb7fb3008, callLimit=0)
    at qpid/agent/ManagementAgentImpl.cpp:359
#4  0x080aa8eb in HandleMgmtSocket () at job_server_main.cpp:314
#5  0x080ca9b3 in DaemonCore::CallSocketHandler_worker (this=0x81dddf0, i=2, default_to_HandleCommand=true, asock=0x0)
    at daemon_core.cpp:3501
#6  0x080caf0c in DaemonCore::CallSocketHandler_worker_demarshall (arg=0x82b8740) at daemon_core.cpp:3468
#7  0x0810fb97 in CondorThreads::pool_add (routine=0x80caee0 <DaemonCore::CallSocketHandler_worker_demarshall(void*)>, 
    arg=0x82b8740, tid=0x81e1e2c, descrip=0x82025a8 "Handler for Mgmt Methods.") at condor_threads.cpp:1106
#8  0x080c37e7 in DaemonCore::CallSocketHandler (this=0x81dddf0, i=@0xbfffe83c, default_to_HandleCommand=true)
    at daemon_core.cpp:3456
#9  0x080c5ea1 in DaemonCore::Driver (this=0x81dddf0) at daemon_core.cpp:3347
#10 0x080d24e5 in main (argc=1, argv=0xbfffee5c) at daemon_core_main.cpp:2295

Problem is believed to be attributed to job/submission decrements that were removed for correct job totals. However these decrements also act on the submission job state std::sets.

(This implementation needs reworking)

Comment 1 Pete MacKinnon 2010-12-29 19:32:37 UTC
FH sha d96a25fba in V7_4-QMF-branch

Unit testing of 10 submissions with random async Resets 
show no SEGVs on QMF job server queries in parallel.

There is still "catch up" that occurs since we wipe out 
all jobs, clusters and submission objects on a Reset. 
Jobs already in the JQL do not trigger a NewClassAd call 
from the JQL reader. Thus, we have to wait for the 
history jobs to coalesce. The infrastructure is in place 
to make this more seamless but needs further discussion.

Resets can be triggered using "pkill -10 condor_job_serv".

jobinventory.py jobserver:jobs@<your JS name>
submissions.py jobserver:jobs@<your JS name>

can be used to do the QMF queries.

Comment 2 Pete MacKinnon 2011-01-06 16:28:59 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause - Condor issues a reset of the job queue log file on a daily timer or if there was a problem probing the file.

Consequence - A QMF query for job submission information could lead to a SEGV of the job server after the reset.

Fix - Significant refactoring of job server internals to ensure no interface usage of deleted C++ objects.

Result - Job server no longer SEGV after a reset followed by QMF queries.

Comment 4 Pete MacKinnon 2011-01-12 13:58:16 UTC
Created attachment 473024 [details]
Python test script to scan submissions from QMF

Reset can be triggered by sending SIGUSR1 (10) to job server process

Comment 5 Pete MacKinnon 2011-01-12 14:12:18 UTC
Created attachment 473031 [details]
Py script to match condor jobs to jobs listed in QMF

Comment 6 Pete MacKinnon 2011-01-12 14:14:57 UTC
Created attachment 473033 [details]
Script to print out details of every grid QMF object on a broker

Comment 7 Pete MacKinnon 2011-01-13 15:03:53 UTC
Tomas, can you try to repro with condor-qmf-7.4.4-0.16 instead of -17? There were interim code changes in -17 that may be masking (but not correcting) the original bug.

Comment 9 Pete MacKinnon 2011-01-18 17:47:16 UTC
QUEUE_CLEAN_INTERVAL was set to 5 determining the Reset frequency

Comment 10 Tomas Rusnak 2011-01-19 10:13:03 UTC
Reproduced on:

$CondorVersion: 7.4.4 Sep 27 2010 BuildID: RH-7.4.4-0.16.el5 PRE-RELEASE $
$CondorPlatform: I386-LINUX_RHEL5 $

Config:
JOB_SERVER_LOG=/var/log/condor/jobserver.log
QMF_PUBLISH_SUBMISSIONS=FALSE
USE_PROCD = False
COLLECTOR_QUERY_WORKERS = 0
SCHEDD_QUERY_WORKERS = 0
SCHEDD_INTERVAL = 5
HISTORY_INTERVAL = 30
QUEUE_CLEAN_INTERVAL = 5
ALL_DEBUG=D_FULLDEBUG

# cat rhelprog.sub 
Universe        = vanilla
Executable      = /bin/sleep
Arguments       = $RANDOM_INTEGER(5,30,1)
should_transfer_files = false
Queue  $RANDOM_INTEGER(1,5,1)

Run in loop:
python submissions.py jobserver:jobs@localhost

JobServer.log:
1/19/11 05:06:06 LiveJob destroyed: key '829.1'
01/19/11 05:06:06 LiveJob destroyed: key '830.1'
01/19/11 05:06:06 LiveJob destroyed: key '830.2'
Stack dump for process 15451 at timestamp 1295431566 (15 frames)
condor_job_server(dprintf_dump_stack+0x44)[0x80df9c4]
condor_job_server[0x80e1804]
[0x455420]
condor_job_server(_ZNSt8_Rb_treeIPK3JobS2_St9_IdentityIS2_E6cmpjobSaIS2_EE5eraseERKS2_+0x28)[0x80b6eb8]
condor_job_server(_ZN16SubmissionObject9DecrementEPK3Job+0x128)[0x80b6568]
condor_job_server(_ZN3Job19DecrementSubmissionEv+0x13)[0x80b15e3]
condor_job_server(_ZN7LiveJobD0Ev+0x2f)[0x80b161f]
condor_job_server(_ZN23JobServerJobLogConsumer5ResetEv+0x54)[0x80b71a4]
condor_job_server(_ZN12JobLogReader8BulkLoadEv+0x21)[0x8110701]
condor_job_server(_ZN12JobLogReader4PollEv+0xa9)[0x81107c9]
condor_job_server(_ZN12TimerManager7TimeoutEv+0x14b)[0x80def2b]
condor_job_server(_ZN10DaemonCore6DriverEv+0x244)[0x80c7784]
condor_job_server(main+0xd80)[0x80db320]
/lib/libc.so.6(__libc_start_main+0xdc)[0x553e9c]
condor_job_server[0x80a2ed1]

Comment 12 Tomas Rusnak 2011-01-19 14:24:29 UTC
Retested on:

$CondorVersion: 7.4.5 Jan  4 2011 BuildID: RH-7.4.5-0.6.el5 PRE-RELEASE $
$CondorPlatform: I386-LINUX_RHEL5 $

$CondorVersion: 7.4.5 Jan  4 2011 BuildID: RH-7.4.5-0.6.el5 PRE-RELEASE $
$CondorPlatform: X86_64-LINUX_RHEL5 $

$CondorVersion: 7.4.5 Jan  4 2011 BuildID: RH-7.4.5-0.6.el4 PRE-RELEASE $
$CondorPlatform: I386-LINUX_RHEL4 $

$CondorVersion: 7.4.5 Jan  4 2011 BuildID: RH-7.4.5-0.6.el4 PRE-RELEASE $
$CondorPlatform: X86_64-LINUX_RHEL4 $

Config:
JOB_SERVER_LOG=/var/log/condor/jobserver.log
QMF_PUBLISH_SUBMISSIONS=FALSE
USE_PROCD = False
COLLECTOR_QUERY_WORKERS = 0
SCHEDD_QUERY_WORKERS = 0
SCHEDD_INTERVAL = 5
HISTORY_INTERVAL = 30
QUEUE_CLEAN_INTERVAL = 5
ALL_DEBUG=D_FULLDEBUG

JobServer.log:
01/19/11 09:19:04 SubmissionObject::Decrement 'IDLE' on '19.0'
01/19/11 09:19:04 LiveJobImpl destroyed: key '19.0'
01/19/11 09:19:04 Job::~Job of '19.1'
01/19/11 09:19:04 SubmissionObject::Decrement 'IDLE' on '19.1'
01/19/11 09:19:04 LiveJobImpl destroyed: key '19.1'
01/19/11 09:19:04 Job::~Job of '19.2'
01/19/11 09:19:04 SubmissionObject::Decrement 'IDLE' on '19.2'
01/19/11 09:19:04 LiveJobImpl destroyed: key '19.2'
01/19/11 09:19:04 Job::~Job of '19.3'
01/19/11 09:19:04 SubmissionObject::Decrement 'IDLE' on '19.3'
01/19/11 09:19:04 LiveJobImpl destroyed: key '19.3'
01/19/11 09:19:04 Job::~Job of '19.4'
01/19/11 09:19:04 SubmissionObject::Decrement 'IDLE' on '19.4'
01/19/11 09:19:04 LiveJobImpl destroyed: key '19.4'
MaxLog = 1048576, length = 0
Saving log file to "/var/log/condor/jobserver.log.old"

No regression found. 20 and 40 random jobs tested with concurrent submissions.py in a loop running.

>>> VERIFIED

Comment 13 Martin Prpič 2011-02-08 17:24:03 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,7 +1 @@
-Cause - Condor issues a reset of the job queue log file on a daily timer or if there was a problem probing the file.
+Prior to this update, Condor issued a reset of the job queue log file on a daily basis or if there was a problem probing the file. As a result, after the reset, a QMF (Quality Management Framework) query for job submission information could lead to a segmentation fault at the job server. With this update, significant refactoring of the job server internals ensure that no interface uses any deleted C++ objects, thus, the job server no longer returns a segmentation fault after a reset followed by QMF queries.-
-Consequence - A QMF query for job submission information could lead to a SEGV of the job server after the reset.
-
-Fix - Significant refactoring of job server internals to ensure no interface usage of deleted C++ objects.
-
-Result - Job server no longer SEGV after a reset followed by QMF queries.

Comment 14 errata-xmlrpc 2011-02-15 12:14:18 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0217.html


Note You need to log in before you can comment on or make changes to this bug.