Bug 489557 - Crash in QMF Management Agent (c++) during connection shutdown
Crash in QMF Management Agent (c++) during connection shutdown
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-qmf (Show other bugs)
1.1
All Linux
urgent Severity high
: 1.2
: ---
Assigned To: Ken Giusti
Frantisek Reznicek
: Reopened
Depends On:
Blocks: 527551
  Show dependency treegraph
 
Reported: 2009-03-10 13:24 EDT by Ted Ross
Modified: 2015-11-15 19:07 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Grid bug fix C: A locked mutex was being destroyed during connection shutdown C: The MgmtMasterPlugin on condor_master crashes F: The QMF Agent was never being notified of the shutdown, and would continue running after the connection - and its resources - was released. Eventually, the Agent thread would attempt to access the released connection's resources - in this case a Mutex that had been freed - which would result in the crash. This behaviour has be fixed by adding new code to the Agent which receives notification when the connection is released, and allows the Agent to remove the connection from its internal database. R: On connection shut down, a QMF Agent will properly release all resources associated with the connection. The Agent will no longer attempt to access "stale" connection data - such as the mutex - which would cause a crash. A locked mutex was being destroyed during connection shutdown, which caused the MgmtMasterPlugin on condor_master to crash. New code has been added to the agent, which allows it to remove the connection from its internal database. QMF agents now properly release all resources associated with the connection, preventing it from crashing.
Story Points: ---
Clone Of:
: 528015 (view as bug list)
Environment:
Last Closed: 2009-12-03 04:17:12 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Full core from condor_schedd SEGV in qpidc client (540.53 KB, application/x-bzip2)
2009-05-06 11:21 EDT, Lans Carstensen
no flags Details
Condor log files (7.70 KB, application/x-compressed-tar)
2009-05-06 11:21 EDT, Lans Carstensen
no flags Details

  None (edit)
Description Ted Ross 2009-03-10 13:24:03 EDT
Description of problem:

A crash has been seen in the MgmtMasterPlugin on condor_master possibly caused by destroying a locked mutex during connection shutdown.  The stack trace is supplied below

Version-Release number of selected component (if applicable):

1.1 and later

How reproducible:

Unknown, this is a one-time occurrence.

Steps to Reproduce:

Unknown
Comment 1 Ted Ross 2009-03-10 13:24:39 EDT
Stack dump for process 5462 at timestamp 1236614911 (13 frames)
condor_master(dprintf_dump_stack+0xc0)[0x4a44bb]
condor_master[0x4a478e]
/lib64/libc.so.6[0x3d080301b0]
/lib64/libc.so.6(gsignal+0x35)[0x3d08030155]
/lib64/libc.so.6(abort+0x110)[0x3d08031bf0]
/lib64/libc.so.6(__assert_fail+0xf6)[0x3d080295d6]
/usr/lib64/condor/plugins/MgmtMasterPlugin-plugin.so(_ZN4qpid3sys5MutexD1Ev+0x55
)[0x2aaaaaaeae5d]
/usr/lib64/libqpidclient.so.0(_ZN4qpid6client10DispatcherD1Ev+0x13a)[0x35d9e755f
a]
/usr/lib64/libqpidclient.so.0(_ZN4qpid6client19SubscriptionManagerD0Ev+0x3c)[0x3
5d9ea0acc]
/usr/lib64/libqmfagent.so.0(_ZN4qpid10management19ManagementAgentImpl16Connectio
nThread3runEv+0xa3f)[0x35da61368f]
/usr/lib64/libqpidcommon.so.0[0x2aaaaae8b1ca]
/lib64/libpthread.so.0[0x3d08c062f7]
/lib64/libc.so.6(clone+0x6d)[0x3d080d1b6d]
Comment 2 Ted Ross 2009-03-10 16:49:24 EDT
Version info:

qpidc-0.4.732838-1.el5
condor-qmf-plugins-7.2.0-3.el5

(x86_64)
Comment 3 Lans Carstensen 2009-05-06 11:19:46 EDT
I may have reproduced a similar issue on shutdown of the qmf-plugin for schedd.  I'm attaching logs and cores from an HA schedd system where all that was done was 'service condor start' and 'service condor stop' and that resulted in a SEGV.  I can't get a good backtrace into qpidc from the debuginfo RPM on the beta site - I get CRC errors.

I'll be attaching the core and all condor logs.

Versions:
qpidc-0.5.752581-5.el5
condor-qmf-plugins-7.2.2-0.9.el5

From SchedLog:

5/5 11:03:46 (pid:316) About to rotate ClassAd log /usr/home/condor/ha-sched/job
_queue.log
5/5 11:04:12 (pid:316) Got SIGQUIT.  Performing fast shutdown.
5/5 11:04:12 (pid:316) All shadows have been killed, exiting.
Stack dump for process 316 at timestamp 1241546652 (13 frames)
5/5 11:04:12 (pid:316) **** condor_schedd (condor_SCHEDD) pid 316 EXITING WITH S
TATUS 0
condor_schedd(dprintf_dump_stack+0xb7)[0x5614ef]
condor_schedd[0x56175e]
/lib64/libc.so.6[0x312aa301b0]
condor_schedd(_ZN10DaemonCore6getpidEv+0xc)[0x4d02e4]
condor_schedd(_Z12unix_sigchldi+0x17)[0x556723]
/lib64/libc.so.6[0x312aa301b0]
/lib64/libc.so.6(epoll_wait+0x58)[0x312aad1f58]
/usr/lib64/libqpidcommon.so.0(_ZN4qpid3sys6Poller4waitENS0_8DurationE+0x15d)[0x3
da4f749ad]
/usr/lib64/libqpidcommon.so.0(_ZN4qpid3sys6Poller3runEv+0x37)[0x3da4f75787]
/usr/lib64/libqpidclient.so.0(_ZN4qpid6client12TCPConnector3runEv+0x16b)[0x3da54
65edb]
/usr/lib64/libqpidcommon.so.0[0x3da4f6c76a]
/lib64/libpthread.so.0[0x312b6062f7]
/lib64/libc.so.6(clone+0x6d)[0x312aad1b6d]
Comment 4 Lans Carstensen 2009-05-06 11:21:03 EDT
Created attachment 342675 [details]
Full core from condor_schedd SEGV in qpidc client
Comment 5 Lans Carstensen 2009-05-06 11:21:50 EDT
Created attachment 342676 [details]
Condor log files
Comment 6 Issue Tracker 2009-08-18 18:12:36 EDT
Event posted on 08-18-2009 06:12pm EDT by cwyse

Has the information provided given us any clues as to what caused this. 
Has it happened since?  Are we even still using the same versions of
software that we were when it crashed?


This event sent from IssueTracker by cwyse 
 issue 312002
Comment 7 Frantisek Reznicek 2009-09-29 12:24:16 EDT
Could you possibly specify in what configuration it was tested and steps to reproduce, please?
Comment 8 RHEL Product and Program Management 2009-09-29 13:00:34 EDT
Quality Engineering Management has reviewed and declined this request.  You may
appeal this decision by reopening this request.
Comment 12 Gordon Sim 2009-10-09 03:08:34 EDT
The bug from comment #3 is distinct and has now been given its own BZ: https://bugzilla.redhat.com/show_bug.cgi?id=528015
Comment 13 Ken Giusti 2009-10-09 09:57:29 EDT
Steps I used to repro the crash:

1) enable core dumps (ulimit -c unlimited)
2) build the qmf-agent example from cpp/examples/qmf-agent
3) run the agent, but give it a port # that is *NOT* bound to a broker (you want the connection to fail).  Example:

 $ LD_LIBRARY_PATH="../../src/.libs" ./qmf-agent 127.0.0.1 99999

(note: I had to set the library search path in order to find the qmf dynamic link libs... ymmv).

4) Hit ^C....
5) Crash!

Usually takes a few tries to repro - I've had to do it up to 20 times before getting the crash.  Waiting a few seconds between steps 3&4 seems to increase the likelyhood.
Comment 15 Irina Boverman 2009-10-28 13:02:41 EDT
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
Resolved crash in QMF Management Agent (c++) during connection shutdown (489557)
Comment 16 Frantisek Reznicek 2009-10-30 04:39:47 EDT
The issue has been fixed, validated on RHEL 4.8 / 5.4 i386 / x86_64 on (source) packages: qpid*-0.5.752581-30.el5

No more segfault observed after long run.

-> VERIFIED
Comment 17 Lana Brindley 2009-11-11 16:47:32 EST
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1,10 @@
-Resolved crash in QMF Management Agent (c++) during connection shutdown (489557)+Grid bug fix
+
+C: A locked mutex was being destroyed during connection shutdown
+C: The MgmtMasterPlugin on condor_master crashes
+F:
+R:
+
+Resolved crash in QMF Management Agent (c++) during connection shutdown (489557)
+
+MORE INFORMATION REQUIRED FOR RELNOTE.
Comment 18 Ken Giusti 2009-11-13 08:32:41 EST
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -2,8 +2,8 @@
 
 C: A locked mutex was being destroyed during connection shutdown
 C: The MgmtMasterPlugin on condor_master crashes
-F:
-R:
+F: The QMF Agent was never being notified of the shutdown, and would continue running after the connection - and its resources - was released.  Eventually, the Agent thread would attempt to access the released connection's resources - in this case a Mutex that had been freed - which would result in the crash.  This behaviour has be fixed by adding new code to the Agent which receives notification when the connection is released, and allows the Agent to remove the connection from its internal database.
+R: On connection shut down, a QMF Agent will properly release all resources associated with the connection.  The Agent will no longer attempt to access "stale" connection data - such as the mutex - which would cause a crash.
 
 Resolved crash in QMF Management Agent (c++) during connection shutdown (489557)
Comment 19 Lana Brindley 2009-11-18 23:33:12 EST
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -5,6 +5,5 @@
 F: The QMF Agent was never being notified of the shutdown, and would continue running after the connection - and its resources - was released.  Eventually, the Agent thread would attempt to access the released connection's resources - in this case a Mutex that had been freed - which would result in the crash.  This behaviour has be fixed by adding new code to the Agent which receives notification when the connection is released, and allows the Agent to remove the connection from its internal database.
 R: On connection shut down, a QMF Agent will properly release all resources associated with the connection.  The Agent will no longer attempt to access "stale" connection data - such as the mutex - which would cause a crash.
 
-Resolved crash in QMF Management Agent (c++) during connection shutdown (489557)
 
-MORE INFORMATION REQUIRED FOR RELNOTE.+A locked mutex was being destroyed during connection shutdown, which caused the MgmtMasterPlugin on condor_master to crash. New code has been added to the agent, which allows it to remove the connection from its internal database. QMF agents now properly release all resources associated with the connection, preventing it from crashing.
Comment 21 errata-xmlrpc 2009-12-03 04:17:12 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1633.html

Note You need to log in before you can comment on or make changes to this bug.