Bug 877553

Summary: Crash traced to generated QMF code
Product: Red Hat Enterprise MRG Reporter: Pete MacKinnon <pmackinn>
Component: qpid-qmfAssignee: Ken Giusti <kgiusti>
Status: CLOSED ERRATA QA Contact: Frantisek Reznicek <freznice>
Severity: urgent Docs Contact:
Priority: high    
Version: DevelopmentCC: esammons, freznice, iboverma, jross, lzhaldyb, matt, mcressma, tross, tstclair
Target Milestone: 2.3Keywords: OtherQA, Regression
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: qpid-cpp-0.18-12 Doc Type: Bug Fix
Doc Text:
Cause A rarely used internal data structure was being modified by a thread without being locked to other threads. Consequence This would result in the data structure getting corrupted and eventually causing the application to crash. Fix A lock was introduced to allow the data structure to be modified safely. Result The internal data structure is no longer corrupted.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-03-06 18:52:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Some further analysis of crash none

Description Pete MacKinnon 2012-11-16 20:59:37 UTC
Seeing repeated crashes in the condor_job_server which incorporates generated QMF code for a Submission class that uses statistics.

  <class name="Submission">
  <!-- snip -->
    <statistic name="Idle" type="count32"/>
    <statistic name="Running" type="count32"/>
    <statistic name="Removed" type="count32"/>
    <statistic name="Completed" type="count32"/>
    <statistic name="Held" type="count32"/>
    <statistic name="TransferringOutput" type="count32"/>
    <statistic name="Suspended" type="count32"/>

The application code worked fine prior to 0.18.


[New Thread 0x7ffff7f70700 (LWP 4342)]
[New Thread 0x7ffff756f700 (LWP 4343)]
[New Thread 0x7ffff6acd700 (LWP 4353)]
[New Thread 0x7ffff60cc700 (LWP 4354)]

Program received signal SIGSEGV, Segmentation fault.
0x0000000000416327 in getThreadStats (this=<value optimized out>, job=0x7391a480) at /usr/src/debug/condor-7.8.6/src/condor_contrib/mgmt/qmf/daemons/Submission.h:80
80              if (threadStats == 0) {
(gdb) where
#0  0x0000000000416327 in getThreadStats (this=<value optimized out>, job=0x7391a480) at /usr/src/debug/condor-7.8.6/src/condor_contrib/mgmt/qmf/daemons/Submission.h:80
#1  dec_Completed (this=<value optimized out>, job=0x7391a480) at /usr/src/debug/condor-7.8.6/src/condor_contrib/mgmt/qmf/daemons/Submission.h:200
#2  SubmissionObject::Decrement (this=<value optimized out>, job=0x7391a480) at /usr/src/debug/condor-7.8.6/src/condor_contrib/mgmt/qmf/daemons/SubmissionObject.cpp:145
#3  0x0000000000410af6 in Job::~Job (this=0x7391a480, __in_chrg=<value optimized out>) at /usr/src/debug/condor-7.8.6/src/condor_contrib/mgmt/qmf/daemons/Job.cpp:538
#4  0x000000000041b8ff in global_reset () at /usr/src/debug/condor-7.8.6/src/condor_contrib/mgmt/qmf/daemons/Globals.cpp:32
#5  0x000000000040fef3 in process_history_files () at /usr/src/debug/condor-7.8.6/src/condor_contrib/mgmt/qmf/daemons/HistoryProcessingUtils.cpp:65
#6  0x000000377c77f47a in ClassAdLogReader::BulkLoad (this=0x8af818) at /usr/src/debug/condor-7.8.6/src/condor_utils/ClassAdLogReader.cpp:109
#7  0x00000000008b08a0 in ?? ()
#8  0x000000377c77f598 in ClassAdLogReader::Poll (this=0x8ae790) at /usr/src/debug/condor-7.8.6/src/condor_utils/ClassAdLogReader.cpp:83
#9  0x0000000050a677f5 in ?? ()
#10 0x00000000008af810 in ?? ()
#11 0x00007fffffffdc80 in ?? ()
#12 0x000000377cb4d7c0 in ?? () from /usr/lib64/libcondor_utils_7_8_7.so
#13 0x000000377cb4bf10 in StartdIntegerKeywords () from /usr/lib64/libcondor_utils_7_8_7.so
#14 0x000000377c7a22f0 in JobLogMirror::TimerHandler_JobLogPolling (this=0x8af818) at /usr/src/debug/condor-7.8.6/src/condor_utils/JobLogMirror.cpp:74
#15 0x000000377cb4d7e0 in ?? () from /usr/lib64/libcondor_utils_7_8_7.so
#16 0x000000377c847611 in TimerManager::Timeout (this=0x8af818, pNumFired=0x7fffffffdc88, pruntime=0x8af818)
    at /usr/src/debug/condor-7.8.6/src/condor_daemon_core.V6/timer_manager.cpp:428
#17 0x00007fffffffd990 in ?? ()
#18 0x00000001f7ffe814 in ?? ()
#19 0x00007fffffffd9a0 in ?? ()
#20 0x00007fffffffd998 in ?? ()
#21 0x00007fffffffd9b0 in ?? ()
#22 0x00007ffff7ffe8bc in gettimeofday ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) list
75          struct PerThreadStats** perThreadStatsArray;
76
77          inline struct PerThreadStats* getThreadStats() {
78              int idx = getThreadIndex();
79              struct PerThreadStats* threadStats = perThreadStatsArray[idx];
80              if (threadStats == 0) {
81                  threadStats = new(PerThreadStats);
82                  perThreadStatsArray[idx] = threadStats;
83                  threadStats->Idle = 0;
84                  threadStats->Running = 0;

Comment 1 Ted Ross 2012-11-16 22:01:23 UTC
Observations:

'threadStats' (the pointer) has a value of 0x400000001 which is likely corrupt.  'idx' is zero, which is reasonable (especially if this is a single-threaded app).

The object constructor allocates perThreadStatsArray and initializes all of the pointers to zero.


Questions:

Could this call have been made on a Submission object that had not been constructed?  It would be interesting to see a gdb dump of the Submission object.  Are the properties reasonable looking?

Comment 2 Pete MacKinnon 2012-11-17 00:47:44 UTC
(gdb) p *job->m_submission
$22 = {<qpid::management::Manageable> = {<No data fields>}, m_idle = std::set with 0 elements, m_running = std::set with 0 elements, m_removed = std::set with 0 elements, 
  m_completed = std::set with 23 elements = {[0] = 0x102d8d20, [1] = 0x719935c0, [2] = 0x430d10a0, [3] = 0x7dcaa00, [4] = 0x499c6550, [5] = 0x15da7470, [6] = 0x3cf49dd0, 
    [7] = 0x2e952ec0, [8] = 0x240ae2c0, [9] = 0x11cf7410, [10] = 0x2f58e490, [11] = 0x422639e0, [12] = 0x203c8ca0, [13] = 0x2ee7b1b0, [14] = 0x10f4c840, [15] = 0x15758d30, 
    [16] = 0x355ed2c0, [17] = 0x584e9300, [18] = 0x2c4ffd50, [19] = 0x67b098e0, [20] = 0x2c0c3ad0, [21] = 0x33457710, [22] = 0x102d8ea0}, m_held = std::set with 0 elements, 
  m_transferring_output = std::set with 0 elements, m_suspended = std::set with 0 elements, ownerSet = true, mgmtObject = 0x344530f0}
(gdb) p *job->m_submission->mgmtObject
$23 = {<qpid::management::ManagementObject> = {<No data fields>}, static packageName = "com.redhat.grid", static className = "submission", 
  static md5Sum = "\275\355\"#\241\ac\217\220\356]\367\345'i\207", jobserverRef = {agent = 0x62fd18, first = 49, second = 7020036704541700711, agentEpoch = 7309956218500558434, 
    v2Key = Traceback (most recent call last):
  File "/usr/lib64/../share/gdb/python/libstdcxx/v6/printers.py", line 558, in to_string
    return self.val['_M_dataplus']['_M_p'].lazy_string (length = len)
RuntimeError: Cannot access memory at address 0x6d6f632e7461684c
, agentName = Traceback (most recent call last):
  File "/usr/lib64/../share/gdb/python/libstdcxx/v6/printers.py", line 558, in to_string
    return self.val['_M_dataplus']['_M_p'].lazy_string (length = len)
RuntimeError: Cannot access memory at address 0x3231393132360b
}, Name = "", Owner = Traceback (most recent call last):
  File "/usr/lib64/../share/gdb/python/libstdcxx/v6/printers.py", line 558, in to_string
    return self.val['_M_dataplus']['_M_p'].lazy_string (length = len)
RuntimeError: Cannot access memory at address 0x9
, QDate = 3615880547319296566, perThreadStatsArray = 0xfd4200, static METHOD_ECHO = 1, static METHOD_GETJOBSUMMARIES = 2}

(gdb) p threadStats
$1 = (qmf::com::redhat::grid::Submission::PerThreadStats *) 0x400000001
(gdb) p *threadStats Cannot access memory at address 0x400000001
(gdb) p $1
$2 = (qmf::com::redhat::grid::Submission::PerThreadStats *) 0x400000001
(gdb) p idx
$3 = 0
(gdb) p perThreadStatsArray
$4 = (qmf::com::redhat::grid::Submission::PerThreadStats **) 0xfd4200
(gdb) p *perThreadStatsArray
$5 = (qmf::com::redhat::grid::Submission::PerThreadStats *) 0x400000001

Comment 3 Ken Giusti 2012-11-21 14:40:27 UTC
Created attachment 649269 [details]
Some further analysis of crash

Comment 4 Ken Giusti 2012-11-21 14:49:24 UTC
With help from Pete MacKinnon, got another crash.

See attachment for details.

As pointed out by Ted, the mgmtObject (a Submission class) is filled with garbage.  It has either been overwritten, or referenced after freed.

My guess is that it has been freed (yeah, biased - but hear me out).  From what little I understand of the condor code, I think the parent object (SubmissionObject) that owns the mgmtObject marks the mgmtObject as deleted in its destructor.

As far as I can tell, the SubmissionObject is allocated by a Job, but is held in a global map called the g_submissions map.  The Job doesn't seem to delete the SubmissionObject, rather I assume the g_submissions map claims ownership.

From what I can tell, the SubmissionObject is not in the g_submissions map at the point of the crash, indicating that it may have already been freed.

Hard to tell, since the g_submissions map has well over 1/4 million entries at the time of the crash, but the suspect index seems to be pointing elsewhere.

One potential debug hack I'd like to try would be to zero-out the pointer to the mgmtObject in the SubmissionObject in the SubmissionObject's destructor (after calling resourceDestroy().  If the crash signature changes to a null dereference, it would likely point to use after free.

As an aside: I ran condor_job_server under valgrind, to see if it would catch such usage.  It did find some use-after-free errors, but none that seemed directly related (see attached).

Comment 5 Pete MacKinnon 2012-11-21 15:04:41 UTC
for (SubmissionCollectionType::iterator i = g_submissions.begin();
        g_submissions.end() != i; i++) {
	delete (*i).second;
	g_submissions.erase(i);
}

SubmissionObject::~SubmissionObject()
{
	dprintf ( D_FULLDEBUG, "SubmissionObject::~SubmissionObject for '%s'\n", mgmtObject->get_Name().c_str());
    if ( mgmtObject )
    {
        mgmtObject->resourceDestroy();
    }
}

Same code used with 0.14. The mgmtObject is never deleted and only marked for resourceDestory when we walk through and clean up the map above.

Comment 6 Ken Giusti 2012-11-28 18:38:26 UTC
Posting this in case anyone recognizes these patterns:

I've added some "guard arrays" in the Submission object and pattern them to a known value on object construction (0x1D).  I've instrumented several assert() checks to try to catch when these guards are overwritten.

In Submission.h:

    // Properties
    char guard1[16];
    ::qpid::management::ObjectId jobserverRef;
    char guard2[16];
    std::string Name;
    char guard3[16];
    std::string Owner;
    char guard4[16];
    int64_t QDate;
    char guard5[16];

    // Statistics


    // Per-Thread Statistics

    void CHECK() const {
      assert(guard1[0] == 0x1D);
      assert(guard2[0] == 0x1D);
      assert(guard3[0] == 0x1D);
      assert(guard4[0] == 0x1D);
      assert(guard5[0] == 0x1D);
      assert(guard6[0] == 0x1D);
    }

I've just hit an assert - essentially the same point in the code as previous failures (Completed counter increment), so nothing new there:


(gdb) bt
#0  0x00000035642328a5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003564234085 in abort () at abort.c:92
#2  0x000000356422ba1e in __assert_fail_base (fmt=<value optimized out>, assertion=0x42a76\
3 "guard1[0] == 0x1D", file=0x42a710 "/root/rpmbuild/BUILD/condor-7.8.6/src/condor_contrib\
/mgmt/qmf/daemons/Submission.h", line=<value optimized out>, function=<value optimized out\
>) at assert.c:96
#3  0x000000356422bae0 in __assert_fail (assertion=0x42a763 "guard1[0] == 0x1D", file=0x42\
a710 "/root/rpmbuild/BUILD/condor-7.8.6/src/condor_contrib/mgmt/qmf/daemons/Submission.h",\
 line=69, function=0x42aa00 "void qmf::com::redhat::grid::Submission::CHECK() const") at a\
ssert.c:105
#4  0x000000000041a27d in qmf::com::redhat::grid::Submission::CHECK (this=<value optimized\
 out>) at /root/rpmbuild/BUILD/condor-7.8.6/src/condor_contrib/mgmt/qmf/daemons/Submission\
.h:69
#5  0x000000000041be1e in getThreadStats (this=0xa96660e0, job=0x8ee37360) at /root/rpmbui\
ld/BUILD/condor-7.8.6/src/condor_contrib/mgmt/qmf/daemons/Submission.h:93
#6  inc_Completed (this=0xa96660e0, job=0x8ee37360) at /root/rpmbuild/BUILD/condor-7.8.6/s\
rc/condor_contrib/mgmt/qmf/daemons/Submission.h:231
#7  SubmissionObject::Increment (this=0xa96660e0, job=0x8ee37360) at /root/rpmbuild/BUILD/\
condor-7.8.6/src/condor_contrib/mgmt/qmf/daemons/SubmissionObject.cpp:103
#8  0x000000000041ebd3 in Job::SetSubmission (this=0x8ee37360, _subName=<value optimized o\
ut>, cluster=593087) at /root/rpmbuild/BUILD/condor-7.8.6/src/condor_contrib/mgmt/qmf/daem\
ons/Job.cpp:663
#9  0x000000000041efaa in Job::SetImpl (this=0x8ee37360, hji=0x8ee372a0) at /root/rpmbuild\
/BUILD/condor-7.8.6/src/condor_contrib/mgmt/qmf/daemons/Job.cpp:582
#10 0x0000000000424263 in process (entry=<value optimized out>) at /root/rpmbuild/BUILD/co\
ndor-7.8.6/src/condor_contrib/mgmt/qmf/daemons/HistoryProcessingUtils.cpp:94
#11 0x000000000042458d in ProcessHistoryDirectory () at /root/rpmbuild/BUILD/condor-7.8.6/\
src/condor_contrib/mgmt/qmf/daemons/HistoryProcessingUtils.cpp:158
#12 0x0000000000424fc8 in process_history_files () at /root/rpmbuild/BUILD/condor-7.8.6/sr\
c/condor_contrib/mgmt/qmf/daemons/HistoryProcessingUtils.cpp:67
#13 0x0000003df1f7f78a in ClassAdLogReader::BulkLoad (this=0x6de5f8) at /usr/src/debug/con\
dor-7.8.6/src/condor_utils/ClassAdLogReader.cpp:109
#14 0x0000003df1f7f8a8 in ClassAdLogReader::Poll (this=0x6de5f8) at /usr/src/debug/condor-\
7.8.6/src/condor_utils/ClassAdLogReader.cpp:83
#15 0x0000003df1fa25a0 in JobLogMirror::TimerHandler_JobLogPolling (this=0x6de5f0) at /usr\
/src/debug/condor-7.8.6/src/condor_utils/JobLogMirror.cpp:74
#16 0x0000003df20478c1 in TimerManager::Timeout (this=0x3df234dda0, pNumFired=0x7fffffffdb\
98, pruntime=0x7fffffffdb90) at /usr/src/debug/condor-7.8.6/src/condor_daemon_core.V6/time\
r_manager.cpp:428
#17 0x0000003df2055083 in DaemonCore::Driver (this=0x648690) at /usr/src/debug/condor-7.8.\
6/src/condor_daemon_core.V6/daemon_core.cpp:3152
#18 0x0000003df20445d0 in dc_main (argc=1, argv=0x7fffffffe2b0) at /usr/src/debug/condor-7\
.8.6/src/condor_daemon_core.V6/daemon_core_main.cpp:2410
#19 0x000000356421ecdd in __libc_start_main (main=0x40dcc0 <main(int, char**)>, argc=2, ub\
p_av=0x7fffffffe2a8, init=<value optimized out>, fini=<value optimized out>, rtld_fini=<va\
lue optimized out>, stack_end=0x7fffffffe298) at libc-start.c:226
#20 0x000000000040d819 in _start ()


However parts of the guard arrays have been overwritten with what looks like ascii data:



(gdb) p /x mgmtObject->guard1
$21 = {0x0, 0x0, 0x0, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xa, 0x0, 0x0, 0x0, 0x0}
(gdb) p /x mgmtObject->guard2
$22 = {0x0, 0x0, 0x0, 0x0, 0x2e, 0x63, 0x6f, 0x6d, 0x6d, 0x66, 0x61, 0x72, 0x72, 0x65, 0x6\
c, 0x6c}

(gdb) p /s &mgmtObject->guard2[4]
$23 = 0x81b52164 ".commfarrell"

(gdb) p /x mgmtObject->guard3
$24 = {0x31, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
(gdb) p /x mgmtObject->guard4
$25 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x35, 0x33, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
(gdb) p /x mgmtObject->guard5
$26 = {0x51, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2c, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
(gdb) p /x mgmtObject->guard6
$27 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x67, 0x72, 0x69, 0x64, 0x30, 0x2e, 0x6c, 0\
x61}

(gdb) p /s &mgmtObject->guard6[8]
$28 = 0x81b521c8 "grid0.lab.bos.redhat.com#936319.6#1353614605"


Do these strings seem familiar to anyone?  Are they related to Jobs or Submissions?

Comment 7 Pete MacKinnon 2012-11-29 16:38:49 UTC
Yeah those values are properties set on the wrapper that are assigned to the contained mgmtObject.

Comment 8 Ken Giusti 2012-12-01 18:27:24 UTC
I've found a point in the code where a lock was missed.

Ran overnight with this fix - no crashes.

See upstream:

https://issues.apache.org/jira/browse/QPID-4485

And commit:

http://svn.apache.org/viewvc?view=revision&revision=1416037

Comment 16 Frantisek Reznicek 2013-02-04 11:09:54 UTC
Hello Pete,
I'm currently trying hard to reproduce issue with focused reproducer. As it is very difficult to trigger I'd like also to get your (testing) feedback about Ken's patch applied in qpid-cpp-0.18-12. Do you still see occasional crashes or not anymore (while using qpid-cpp-*0.18-12+).

Comment 20 errata-xmlrpc 2013-03-06 18:52:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0561.html