Bug 877553
Summary: | Crash traced to generated QMF code | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Pete MacKinnon <pmackinn> | ||||
Component: | qpid-qmf | Assignee: | Ken Giusti <kgiusti> | ||||
Status: | CLOSED ERRATA | QA Contact: | Frantisek Reznicek <freznice> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | high | ||||||
Version: | Development | CC: | esammons, freznice, iboverma, jross, lzhaldyb, matt, mcressma, tross, tstclair | ||||
Target Milestone: | 2.3 | Keywords: | OtherQA, Regression | ||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | qpid-cpp-0.18-12 | Doc Type: | Bug Fix | ||||
Doc Text: |
Cause
A rarely used internal data structure was being modified by a thread without being locked to other threads.
Consequence
This would result in the data structure getting corrupted and eventually causing the application to crash.
Fix
A lock was introduced to allow the data structure to be modified safely.
Result
The internal data structure is no longer corrupted.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2013-03-06 18:52:57 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Pete MacKinnon
2012-11-16 20:59:37 UTC
Observations: 'threadStats' (the pointer) has a value of 0x400000001 which is likely corrupt. 'idx' is zero, which is reasonable (especially if this is a single-threaded app). The object constructor allocates perThreadStatsArray and initializes all of the pointers to zero. Questions: Could this call have been made on a Submission object that had not been constructed? It would be interesting to see a gdb dump of the Submission object. Are the properties reasonable looking? (gdb) p *job->m_submission $22 = {<qpid::management::Manageable> = {<No data fields>}, m_idle = std::set with 0 elements, m_running = std::set with 0 elements, m_removed = std::set with 0 elements, m_completed = std::set with 23 elements = {[0] = 0x102d8d20, [1] = 0x719935c0, [2] = 0x430d10a0, [3] = 0x7dcaa00, [4] = 0x499c6550, [5] = 0x15da7470, [6] = 0x3cf49dd0, [7] = 0x2e952ec0, [8] = 0x240ae2c0, [9] = 0x11cf7410, [10] = 0x2f58e490, [11] = 0x422639e0, [12] = 0x203c8ca0, [13] = 0x2ee7b1b0, [14] = 0x10f4c840, [15] = 0x15758d30, [16] = 0x355ed2c0, [17] = 0x584e9300, [18] = 0x2c4ffd50, [19] = 0x67b098e0, [20] = 0x2c0c3ad0, [21] = 0x33457710, [22] = 0x102d8ea0}, m_held = std::set with 0 elements, m_transferring_output = std::set with 0 elements, m_suspended = std::set with 0 elements, ownerSet = true, mgmtObject = 0x344530f0} (gdb) p *job->m_submission->mgmtObject $23 = {<qpid::management::ManagementObject> = {<No data fields>}, static packageName = "com.redhat.grid", static className = "submission", static md5Sum = "\275\355\"#\241\ac\217\220\356]\367\345'i\207", jobserverRef = {agent = 0x62fd18, first = 49, second = 7020036704541700711, agentEpoch = 7309956218500558434, v2Key = Traceback (most recent call last): File "/usr/lib64/../share/gdb/python/libstdcxx/v6/printers.py", line 558, in to_string return self.val['_M_dataplus']['_M_p'].lazy_string (length = len) RuntimeError: Cannot access memory at address 0x6d6f632e7461684c , agentName = Traceback (most recent call last): File "/usr/lib64/../share/gdb/python/libstdcxx/v6/printers.py", line 558, in to_string return self.val['_M_dataplus']['_M_p'].lazy_string (length = len) RuntimeError: Cannot access memory at address 0x3231393132360b }, Name = "", Owner = Traceback (most recent call last): File "/usr/lib64/../share/gdb/python/libstdcxx/v6/printers.py", line 558, in to_string return self.val['_M_dataplus']['_M_p'].lazy_string (length = len) RuntimeError: Cannot access memory at address 0x9 , QDate = 3615880547319296566, perThreadStatsArray = 0xfd4200, static METHOD_ECHO = 1, static METHOD_GETJOBSUMMARIES = 2} (gdb) p threadStats $1 = (qmf::com::redhat::grid::Submission::PerThreadStats *) 0x400000001 (gdb) p *threadStats Cannot access memory at address 0x400000001 (gdb) p $1 $2 = (qmf::com::redhat::grid::Submission::PerThreadStats *) 0x400000001 (gdb) p idx $3 = 0 (gdb) p perThreadStatsArray $4 = (qmf::com::redhat::grid::Submission::PerThreadStats **) 0xfd4200 (gdb) p *perThreadStatsArray $5 = (qmf::com::redhat::grid::Submission::PerThreadStats *) 0x400000001 Created attachment 649269 [details]
Some further analysis of crash
With help from Pete MacKinnon, got another crash. See attachment for details. As pointed out by Ted, the mgmtObject (a Submission class) is filled with garbage. It has either been overwritten, or referenced after freed. My guess is that it has been freed (yeah, biased - but hear me out). From what little I understand of the condor code, I think the parent object (SubmissionObject) that owns the mgmtObject marks the mgmtObject as deleted in its destructor. As far as I can tell, the SubmissionObject is allocated by a Job, but is held in a global map called the g_submissions map. The Job doesn't seem to delete the SubmissionObject, rather I assume the g_submissions map claims ownership. From what I can tell, the SubmissionObject is not in the g_submissions map at the point of the crash, indicating that it may have already been freed. Hard to tell, since the g_submissions map has well over 1/4 million entries at the time of the crash, but the suspect index seems to be pointing elsewhere. One potential debug hack I'd like to try would be to zero-out the pointer to the mgmtObject in the SubmissionObject in the SubmissionObject's destructor (after calling resourceDestroy(). If the crash signature changes to a null dereference, it would likely point to use after free. As an aside: I ran condor_job_server under valgrind, to see if it would catch such usage. It did find some use-after-free errors, but none that seemed directly related (see attached). for (SubmissionCollectionType::iterator i = g_submissions.begin(); g_submissions.end() != i; i++) { delete (*i).second; g_submissions.erase(i); } SubmissionObject::~SubmissionObject() { dprintf ( D_FULLDEBUG, "SubmissionObject::~SubmissionObject for '%s'\n", mgmtObject->get_Name().c_str()); if ( mgmtObject ) { mgmtObject->resourceDestroy(); } } Same code used with 0.14. The mgmtObject is never deleted and only marked for resourceDestory when we walk through and clean up the map above. Posting this in case anyone recognizes these patterns:
I've added some "guard arrays" in the Submission object and pattern them to a known value on object construction (0x1D). I've instrumented several assert() checks to try to catch when these guards are overwritten.
In Submission.h:
// Properties
char guard1[16];
::qpid::management::ObjectId jobserverRef;
char guard2[16];
std::string Name;
char guard3[16];
std::string Owner;
char guard4[16];
int64_t QDate;
char guard5[16];
// Statistics
// Per-Thread Statistics
void CHECK() const {
assert(guard1[0] == 0x1D);
assert(guard2[0] == 0x1D);
assert(guard3[0] == 0x1D);
assert(guard4[0] == 0x1D);
assert(guard5[0] == 0x1D);
assert(guard6[0] == 0x1D);
}
I've just hit an assert - essentially the same point in the code as previous failures (Completed counter increment), so nothing new there:
(gdb) bt
#0 0x00000035642328a5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x0000003564234085 in abort () at abort.c:92
#2 0x000000356422ba1e in __assert_fail_base (fmt=<value optimized out>, assertion=0x42a76\
3 "guard1[0] == 0x1D", file=0x42a710 "/root/rpmbuild/BUILD/condor-7.8.6/src/condor_contrib\
/mgmt/qmf/daemons/Submission.h", line=<value optimized out>, function=<value optimized out\
>) at assert.c:96
#3 0x000000356422bae0 in __assert_fail (assertion=0x42a763 "guard1[0] == 0x1D", file=0x42\
a710 "/root/rpmbuild/BUILD/condor-7.8.6/src/condor_contrib/mgmt/qmf/daemons/Submission.h",\
line=69, function=0x42aa00 "void qmf::com::redhat::grid::Submission::CHECK() const") at a\
ssert.c:105
#4 0x000000000041a27d in qmf::com::redhat::grid::Submission::CHECK (this=<value optimized\
out>) at /root/rpmbuild/BUILD/condor-7.8.6/src/condor_contrib/mgmt/qmf/daemons/Submission\
.h:69
#5 0x000000000041be1e in getThreadStats (this=0xa96660e0, job=0x8ee37360) at /root/rpmbui\
ld/BUILD/condor-7.8.6/src/condor_contrib/mgmt/qmf/daemons/Submission.h:93
#6 inc_Completed (this=0xa96660e0, job=0x8ee37360) at /root/rpmbuild/BUILD/condor-7.8.6/s\
rc/condor_contrib/mgmt/qmf/daemons/Submission.h:231
#7 SubmissionObject::Increment (this=0xa96660e0, job=0x8ee37360) at /root/rpmbuild/BUILD/\
condor-7.8.6/src/condor_contrib/mgmt/qmf/daemons/SubmissionObject.cpp:103
#8 0x000000000041ebd3 in Job::SetSubmission (this=0x8ee37360, _subName=<value optimized o\
ut>, cluster=593087) at /root/rpmbuild/BUILD/condor-7.8.6/src/condor_contrib/mgmt/qmf/daem\
ons/Job.cpp:663
#9 0x000000000041efaa in Job::SetImpl (this=0x8ee37360, hji=0x8ee372a0) at /root/rpmbuild\
/BUILD/condor-7.8.6/src/condor_contrib/mgmt/qmf/daemons/Job.cpp:582
#10 0x0000000000424263 in process (entry=<value optimized out>) at /root/rpmbuild/BUILD/co\
ndor-7.8.6/src/condor_contrib/mgmt/qmf/daemons/HistoryProcessingUtils.cpp:94
#11 0x000000000042458d in ProcessHistoryDirectory () at /root/rpmbuild/BUILD/condor-7.8.6/\
src/condor_contrib/mgmt/qmf/daemons/HistoryProcessingUtils.cpp:158
#12 0x0000000000424fc8 in process_history_files () at /root/rpmbuild/BUILD/condor-7.8.6/sr\
c/condor_contrib/mgmt/qmf/daemons/HistoryProcessingUtils.cpp:67
#13 0x0000003df1f7f78a in ClassAdLogReader::BulkLoad (this=0x6de5f8) at /usr/src/debug/con\
dor-7.8.6/src/condor_utils/ClassAdLogReader.cpp:109
#14 0x0000003df1f7f8a8 in ClassAdLogReader::Poll (this=0x6de5f8) at /usr/src/debug/condor-\
7.8.6/src/condor_utils/ClassAdLogReader.cpp:83
#15 0x0000003df1fa25a0 in JobLogMirror::TimerHandler_JobLogPolling (this=0x6de5f0) at /usr\
/src/debug/condor-7.8.6/src/condor_utils/JobLogMirror.cpp:74
#16 0x0000003df20478c1 in TimerManager::Timeout (this=0x3df234dda0, pNumFired=0x7fffffffdb\
98, pruntime=0x7fffffffdb90) at /usr/src/debug/condor-7.8.6/src/condor_daemon_core.V6/time\
r_manager.cpp:428
#17 0x0000003df2055083 in DaemonCore::Driver (this=0x648690) at /usr/src/debug/condor-7.8.\
6/src/condor_daemon_core.V6/daemon_core.cpp:3152
#18 0x0000003df20445d0 in dc_main (argc=1, argv=0x7fffffffe2b0) at /usr/src/debug/condor-7\
.8.6/src/condor_daemon_core.V6/daemon_core_main.cpp:2410
#19 0x000000356421ecdd in __libc_start_main (main=0x40dcc0 <main(int, char**)>, argc=2, ub\
p_av=0x7fffffffe2a8, init=<value optimized out>, fini=<value optimized out>, rtld_fini=<va\
lue optimized out>, stack_end=0x7fffffffe298) at libc-start.c:226
#20 0x000000000040d819 in _start ()
However parts of the guard arrays have been overwritten with what looks like ascii data:
(gdb) p /x mgmtObject->guard1
$21 = {0x0, 0x0, 0x0, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xa, 0x0, 0x0, 0x0, 0x0}
(gdb) p /x mgmtObject->guard2
$22 = {0x0, 0x0, 0x0, 0x0, 0x2e, 0x63, 0x6f, 0x6d, 0x6d, 0x66, 0x61, 0x72, 0x72, 0x65, 0x6\
c, 0x6c}
(gdb) p /s &mgmtObject->guard2[4]
$23 = 0x81b52164 ".commfarrell"
(gdb) p /x mgmtObject->guard3
$24 = {0x31, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
(gdb) p /x mgmtObject->guard4
$25 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x35, 0x33, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
(gdb) p /x mgmtObject->guard5
$26 = {0x51, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2c, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
(gdb) p /x mgmtObject->guard6
$27 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x67, 0x72, 0x69, 0x64, 0x30, 0x2e, 0x6c, 0\
x61}
(gdb) p /s &mgmtObject->guard6[8]
$28 = 0x81b521c8 "grid0.lab.bos.redhat.com#936319.6#1353614605"
Do these strings seem familiar to anyone? Are they related to Jobs or Submissions?
Yeah those values are properties set on the wrapper that are assigned to the contained mgmtObject. I've found a point in the code where a lock was missed. Ran overnight with this fix - no crashes. See upstream: https://issues.apache.org/jira/browse/QPID-4485 And commit: http://svn.apache.org/viewvc?view=revision&revision=1416037 Hello Pete, I'm currently trying hard to reproduce issue with focused reproducer. As it is very difficult to trigger I'd like also to get your (testing) feedback about Ken's patch applied in qpid-cpp-0.18-12. Do you still see occasional crashes or not anymore (while using qpid-cpp-*0.18-12+). Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0561.html |