Bug 489557

Summary:

Crash in QMF Management Agent (c++) during connection shutdown

Product:

Red Hat Enterprise MRG

Reporter:

Ted Ross <tross>

Component:

qpid-qmf

Assignee:

Ken Giusti <kgiusti>

Status:

CLOSED ERRATA

QA Contact:

Frantisek Reznicek <freznice>

Severity:

high

Docs Contact:

Priority:

urgent

Version:

1.1

CC:

esammons, gsim, iboverma, jneedle, kgiusti, lans.carstensen, lbrindle, matt, mcressma, tao

Target Milestone:

1.2

Keywords:

Reopened

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Grid bug fix C: A locked mutex was being destroyed during connection shutdown C: The MgmtMasterPlugin on condor_master crashes F: The QMF Agent was never being notified of the shutdown, and would continue running after the connection - and its resources - was released. Eventually, the Agent thread would attempt to access the released connection's resources - in this case a Mutex that had been freed - which would result in the crash. This behaviour has be fixed by adding new code to the Agent which receives notification when the connection is released, and allows the Agent to remove the connection from its internal database. R: On connection shut down, a QMF Agent will properly release all resources associated with the connection. The Agent will no longer attempt to access "stale" connection data - such as the mutex - which would cause a crash. A locked mutex was being destroyed during connection shutdown, which caused the MgmtMasterPlugin on condor_master to crash. New code has been added to the agent, which allows it to remove the connection from its internal database. QMF agents now properly release all resources associated with the connection, preventing it from crashing.

Story Points:

---

Clone Of:

Clones:

528015 (view as bug list)

Environment:

Last Closed:

2009-12-03 09:17:12 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

527551

Attachments:

Description	Flags
Full core from condor_schedd SEGV in qpidc client	none
Condor log files	none

Description Ted Ross 2009-03-10 17:24:03 UTC

Description of problem:

A crash has been seen in the MgmtMasterPlugin on condor_master possibly caused by destroying a locked mutex during connection shutdown.  The stack trace is supplied below

Version-Release number of selected component (if applicable):

1.1 and later

How reproducible:

Unknown, this is a one-time occurrence.

Steps to Reproduce:

Unknown

Comment 1 Ted Ross 2009-03-10 17:24:39 UTC

Stack dump for process 5462 at timestamp 1236614911 (13 frames)
condor_master(dprintf_dump_stack+0xc0)[0x4a44bb]
condor_master[0x4a478e]
/lib64/libc.so.6[0x3d080301b0]
/lib64/libc.so.6(gsignal+0x35)[0x3d08030155]
/lib64/libc.so.6(abort+0x110)[0x3d08031bf0]
/lib64/libc.so.6(__assert_fail+0xf6)[0x3d080295d6]
/usr/lib64/condor/plugins/MgmtMasterPlugin-plugin.so(_ZN4qpid3sys5MutexD1Ev+0x55
)[0x2aaaaaaeae5d]
/usr/lib64/libqpidclient.so.0(_ZN4qpid6client10DispatcherD1Ev+0x13a)[0x35d9e755f
a]
/usr/lib64/libqpidclient.so.0(_ZN4qpid6client19SubscriptionManagerD0Ev+0x3c)[0x3
5d9ea0acc]
/usr/lib64/libqmfagent.so.0(_ZN4qpid10management19ManagementAgentImpl16Connectio
nThread3runEv+0xa3f)[0x35da61368f]
/usr/lib64/libqpidcommon.so.0[0x2aaaaae8b1ca]
/lib64/libpthread.so.0[0x3d08c062f7]
/lib64/libc.so.6(clone+0x6d)[0x3d080d1b6d]

Comment 2 Ted Ross 2009-03-10 20:49:24 UTC

Version info:

qpidc-0.4.732838-1.el5
condor-qmf-plugins-7.2.0-3.el5

(x86_64)

Comment 3 Lans Carstensen 2009-05-06 15:19:46 UTC

I may have reproduced a similar issue on shutdown of the qmf-plugin for schedd.  I'm attaching logs and cores from an HA schedd system where all that was done was 'service condor start' and 'service condor stop' and that resulted in a SEGV.  I can't get a good backtrace into qpidc from the debuginfo RPM on the beta site - I get CRC errors.

I'll be attaching the core and all condor logs.

Versions:
qpidc-0.5.752581-5.el5
condor-qmf-plugins-7.2.2-0.9.el5

From SchedLog:

5/5 11:03:46 (pid:316) About to rotate ClassAd log /usr/home/condor/ha-sched/job
_queue.log
5/5 11:04:12 (pid:316) Got SIGQUIT.  Performing fast shutdown.
5/5 11:04:12 (pid:316) All shadows have been killed, exiting.
Stack dump for process 316 at timestamp 1241546652 (13 frames)
5/5 11:04:12 (pid:316) **** condor_schedd (condor_SCHEDD) pid 316 EXITING WITH S
TATUS 0
condor_schedd(dprintf_dump_stack+0xb7)[0x5614ef]
condor_schedd[0x56175e]
/lib64/libc.so.6[0x312aa301b0]
condor_schedd(_ZN10DaemonCore6getpidEv+0xc)[0x4d02e4]
condor_schedd(_Z12unix_sigchldi+0x17)[0x556723]
/lib64/libc.so.6[0x312aa301b0]
/lib64/libc.so.6(epoll_wait+0x58)[0x312aad1f58]
/usr/lib64/libqpidcommon.so.0(_ZN4qpid3sys6Poller4waitENS0_8DurationE+0x15d)[0x3
da4f749ad]
/usr/lib64/libqpidcommon.so.0(_ZN4qpid3sys6Poller3runEv+0x37)[0x3da4f75787]
/usr/lib64/libqpidclient.so.0(_ZN4qpid6client12TCPConnector3runEv+0x16b)[0x3da54
65edb]
/usr/lib64/libqpidcommon.so.0[0x3da4f6c76a]
/lib64/libpthread.so.0[0x312b6062f7]
/lib64/libc.so.6(clone+0x6d)[0x312aad1b6d]

Comment 4 Lans Carstensen 2009-05-06 15:21:03 UTC

Created attachment 342675 [details]
Full core from condor_schedd SEGV in qpidc client

Comment 5 Lans Carstensen 2009-05-06 15:21:50 UTC

Created attachment 342676 [details]
Condor log files

Comment 6 Issue Tracker 2009-08-18 22:12:36 UTC

Event posted on 08-18-2009 06:12pm EDT by cwyse

Has the information provided given us any clues as to what caused this. 
Has it happened since?  Are we even still using the same versions of
software that we were when it crashed?


This event sent from IssueTracker by cwyse 
 issue 312002

Comment 7 Frantisek Reznicek 2009-09-29 16:24:16 UTC

Could you possibly specify in what configuration it was tested and steps to reproduce, please?

Comment 8 RHEL Program Management 2009-09-29 17:00:34 UTC

Quality Engineering Management has reviewed and declined this request.  You may
appeal this decision by reopening this request.

Comment 11 Ken Giusti 2009-10-08 22:35:06 UTC

Pulled in the fixes for the crash described by Ted Ross:


http://git.et.redhat.com/git/qpid.git/?p=qpid.git;a=commitdiff;h=5ec65ce2cf2e77e17a1c38edabb0a88b035b4087

http://git.et.redhat.com/git/qpid.git/?p=qpid.git;a=commitdiff;h=44cd54d5deca06b21a27548eead1bfa060148dce

Comment 12 Gordon Sim 2009-10-09 07:08:34 UTC

The bug from comment #3 is distinct and has now been given its own BZ: https://bugzilla.redhat.com/show_bug.cgi?id=528015

Comment 13 Ken Giusti 2009-10-09 13:57:29 UTC

Steps I used to repro the crash:

1) enable core dumps (ulimit -c unlimited)
2) build the qmf-agent example from cpp/examples/qmf-agent
3) run the agent, but give it a port # that is *NOT* bound to a broker (you want the connection to fail).  Example:

 $ LD_LIBRARY_PATH="../../src/.libs" ./qmf-agent 127.0.0.1 99999

(note: I had to set the library search path in order to find the qmf dynamic link libs... ymmv).

4) Hit ^C....
5) Crash!

Usually takes a few tries to repro - I've had to do it up to 20 times before getting the crash.  Waiting a few seconds between steps 3&4 seems to increase the likelyhood.

Comment 15 Irina Boverman 2009-10-28 17:02:41 UTC

Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
Resolved crash in QMF Management Agent (c++) during connection shutdown (489557)

Comment 16 Frantisek Reznicek 2009-10-30 08:39:47 UTC

The issue has been fixed, validated on RHEL 4.8 / 5.4 i386 / x86_64 on (source) packages: qpid*-0.5.752581-30.el5

No more segfault observed after long run.

-> VERIFIED

Comment 17 Lana Brindley 2009-11-11 21:47:32 UTC

Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1,10 @@
-Resolved crash in QMF Management Agent (c++) during connection shutdown (489557)+Grid bug fix
+
+C: A locked mutex was being destroyed during connection shutdown
+C: The MgmtMasterPlugin on condor_master crashes
+F:
+R:
+
+Resolved crash in QMF Management Agent (c++) during connection shutdown (489557)
+
+MORE INFORMATION REQUIRED FOR RELNOTE.

Comment 18 Ken Giusti 2009-11-13 13:32:41 UTC

Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -2,8 +2,8 @@
 
 C: A locked mutex was being destroyed during connection shutdown
 C: The MgmtMasterPlugin on condor_master crashes
-F:
-R:
+F: The QMF Agent was never being notified of the shutdown, and would continue running after the connection - and its resources - was released.  Eventually, the Agent thread would attempt to access the released connection's resources - in this case a Mutex that had been freed - which would result in the crash.  This behaviour has be fixed by adding new code to the Agent which receives notification when the connection is released, and allows the Agent to remove the connection from its internal database.
+R: On connection shut down, a QMF Agent will properly release all resources associated with the connection.  The Agent will no longer attempt to access "stale" connection data - such as the mutex - which would cause a crash.
 
 Resolved crash in QMF Management Agent (c++) during connection shutdown (489557)

Comment 19 Lana Brindley 2009-11-19 04:33:12 UTC

Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -5,6 +5,5 @@
 F: The QMF Agent was never being notified of the shutdown, and would continue running after the connection - and its resources - was released.  Eventually, the Agent thread would attempt to access the released connection's resources - in this case a Mutex that had been freed - which would result in the crash.  This behaviour has be fixed by adding new code to the Agent which receives notification when the connection is released, and allows the Agent to remove the connection from its internal database.
 R: On connection shut down, a QMF Agent will properly release all resources associated with the connection.  The Agent will no longer attempt to access "stale" connection data - such as the mutex - which would cause a crash.
 
-Resolved crash in QMF Management Agent (c++) during connection shutdown (489557)
 
-MORE INFORMATION REQUIRED FOR RELNOTE.+A locked mutex was being destroyed during connection shutdown, which caused the MgmtMasterPlugin on condor_master to crash. New code has been added to the agent, which allows it to remove the connection from its internal database. QMF agents now properly release all resources associated with the connection, preventing it from crashing.

Comment 21 errata-xmlrpc 2009-12-03 09:17:12 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1633.html