Bug 489557
Summary: | Crash in QMF Management Agent (c++) during connection shutdown | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Ted Ross <tross> | ||||||
Component: | qpid-qmf | Assignee: | Ken Giusti <kgiusti> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Frantisek Reznicek <freznice> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 1.1 | CC: | esammons, gsim, iboverma, jneedle, kgiusti, lans.carstensen, lbrindle, matt, mcressma, tao | ||||||
Target Milestone: | 1.2 | Keywords: | Reopened | ||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: |
Grid bug fix
C: A locked mutex was being destroyed during connection shutdown
C: The MgmtMasterPlugin on condor_master crashes
F: The QMF Agent was never being notified of the shutdown, and would continue running after the connection - and its resources - was released. Eventually, the Agent thread would attempt to access the released connection's resources - in this case a Mutex that had been freed - which would result in the crash. This behaviour has be fixed by adding new code to the Agent which receives notification when the connection is released, and allows the Agent to remove the connection from its internal database.
R: On connection shut down, a QMF Agent will properly release all resources associated with the connection. The Agent will no longer attempt to access "stale" connection data - such as the mutex - which would cause a crash.
A locked mutex was being destroyed during connection shutdown, which caused the MgmtMasterPlugin on condor_master to crash. New code has been added to the agent, which allows it to remove the connection from its internal database. QMF agents now properly release all resources associated with the connection, preventing it from crashing.
|
Story Points: | --- | ||||||
Clone Of: | |||||||||
: | 528015 (view as bug list) | Environment: | |||||||
Last Closed: | 2009-12-03 09:17:12 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 527551 | ||||||||
Attachments: |
|
Description
Ted Ross
2009-03-10 17:24:03 UTC
Stack dump for process 5462 at timestamp 1236614911 (13 frames) condor_master(dprintf_dump_stack+0xc0)[0x4a44bb] condor_master[0x4a478e] /lib64/libc.so.6[0x3d080301b0] /lib64/libc.so.6(gsignal+0x35)[0x3d08030155] /lib64/libc.so.6(abort+0x110)[0x3d08031bf0] /lib64/libc.so.6(__assert_fail+0xf6)[0x3d080295d6] /usr/lib64/condor/plugins/MgmtMasterPlugin-plugin.so(_ZN4qpid3sys5MutexD1Ev+0x55 )[0x2aaaaaaeae5d] /usr/lib64/libqpidclient.so.0(_ZN4qpid6client10DispatcherD1Ev+0x13a)[0x35d9e755f a] /usr/lib64/libqpidclient.so.0(_ZN4qpid6client19SubscriptionManagerD0Ev+0x3c)[0x3 5d9ea0acc] /usr/lib64/libqmfagent.so.0(_ZN4qpid10management19ManagementAgentImpl16Connectio nThread3runEv+0xa3f)[0x35da61368f] /usr/lib64/libqpidcommon.so.0[0x2aaaaae8b1ca] /lib64/libpthread.so.0[0x3d08c062f7] /lib64/libc.so.6(clone+0x6d)[0x3d080d1b6d] Version info: qpidc-0.4.732838-1.el5 condor-qmf-plugins-7.2.0-3.el5 (x86_64) I may have reproduced a similar issue on shutdown of the qmf-plugin for schedd. I'm attaching logs and cores from an HA schedd system where all that was done was 'service condor start' and 'service condor stop' and that resulted in a SEGV. I can't get a good backtrace into qpidc from the debuginfo RPM on the beta site - I get CRC errors. I'll be attaching the core and all condor logs. Versions: qpidc-0.5.752581-5.el5 condor-qmf-plugins-7.2.2-0.9.el5 From SchedLog: 5/5 11:03:46 (pid:316) About to rotate ClassAd log /usr/home/condor/ha-sched/job _queue.log 5/5 11:04:12 (pid:316) Got SIGQUIT. Performing fast shutdown. 5/5 11:04:12 (pid:316) All shadows have been killed, exiting. Stack dump for process 316 at timestamp 1241546652 (13 frames) 5/5 11:04:12 (pid:316) **** condor_schedd (condor_SCHEDD) pid 316 EXITING WITH S TATUS 0 condor_schedd(dprintf_dump_stack+0xb7)[0x5614ef] condor_schedd[0x56175e] /lib64/libc.so.6[0x312aa301b0] condor_schedd(_ZN10DaemonCore6getpidEv+0xc)[0x4d02e4] condor_schedd(_Z12unix_sigchldi+0x17)[0x556723] /lib64/libc.so.6[0x312aa301b0] /lib64/libc.so.6(epoll_wait+0x58)[0x312aad1f58] /usr/lib64/libqpidcommon.so.0(_ZN4qpid3sys6Poller4waitENS0_8DurationE+0x15d)[0x3 da4f749ad] /usr/lib64/libqpidcommon.so.0(_ZN4qpid3sys6Poller3runEv+0x37)[0x3da4f75787] /usr/lib64/libqpidclient.so.0(_ZN4qpid6client12TCPConnector3runEv+0x16b)[0x3da54 65edb] /usr/lib64/libqpidcommon.so.0[0x3da4f6c76a] /lib64/libpthread.so.0[0x312b6062f7] /lib64/libc.so.6(clone+0x6d)[0x312aad1b6d] Created attachment 342675 [details]
Full core from condor_schedd SEGV in qpidc client
Created attachment 342676 [details]
Condor log files
Event posted on 08-18-2009 06:12pm EDT by cwyse Has the information provided given us any clues as to what caused this. Has it happened since? Are we even still using the same versions of software that we were when it crashed? This event sent from IssueTracker by cwyse issue 312002 Could you possibly specify in what configuration it was tested and steps to reproduce, please? Quality Engineering Management has reviewed and declined this request. You may appeal this decision by reopening this request. Pulled in the fixes for the crash described by Ted Ross: http://git.et.redhat.com/git/qpid.git/?p=qpid.git;a=commitdiff;h=5ec65ce2cf2e77e17a1c38edabb0a88b035b4087 http://git.et.redhat.com/git/qpid.git/?p=qpid.git;a=commitdiff;h=44cd54d5deca06b21a27548eead1bfa060148dce The bug from comment #3 is distinct and has now been given its own BZ: https://bugzilla.redhat.com/show_bug.cgi?id=528015 Steps I used to repro the crash: 1) enable core dumps (ulimit -c unlimited) 2) build the qmf-agent example from cpp/examples/qmf-agent 3) run the agent, but give it a port # that is *NOT* bound to a broker (you want the connection to fail). Example: $ LD_LIBRARY_PATH="../../src/.libs" ./qmf-agent 127.0.0.1 99999 (note: I had to set the library search path in order to find the qmf dynamic link libs... ymmv). 4) Hit ^C.... 5) Crash! Usually takes a few tries to repro - I've had to do it up to 20 times before getting the crash. Waiting a few seconds between steps 3&4 seems to increase the likelyhood. Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Resolved crash in QMF Management Agent (c++) during connection shutdown (489557) The issue has been fixed, validated on RHEL 4.8 / 5.4 i386 / x86_64 on (source) packages: qpid*-0.5.752581-30.el5 No more segfault observed after long run. -> VERIFIED Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1,10 @@ -Resolved crash in QMF Management Agent (c++) during connection shutdown (489557)+Grid bug fix + +C: A locked mutex was being destroyed during connection shutdown +C: The MgmtMasterPlugin on condor_master crashes +F: +R: + +Resolved crash in QMF Management Agent (c++) during connection shutdown (489557) + +MORE INFORMATION REQUIRED FOR RELNOTE. Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -2,8 +2,8 @@ C: A locked mutex was being destroyed during connection shutdown C: The MgmtMasterPlugin on condor_master crashes -F: -R: +F: The QMF Agent was never being notified of the shutdown, and would continue running after the connection - and its resources - was released. Eventually, the Agent thread would attempt to access the released connection's resources - in this case a Mutex that had been freed - which would result in the crash. This behaviour has be fixed by adding new code to the Agent which receives notification when the connection is released, and allows the Agent to remove the connection from its internal database. +R: On connection shut down, a QMF Agent will properly release all resources associated with the connection. The Agent will no longer attempt to access "stale" connection data - such as the mutex - which would cause a crash. Resolved crash in QMF Management Agent (c++) during connection shutdown (489557) Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -5,6 +5,5 @@ F: The QMF Agent was never being notified of the shutdown, and would continue running after the connection - and its resources - was released. Eventually, the Agent thread would attempt to access the released connection's resources - in this case a Mutex that had been freed - which would result in the crash. This behaviour has be fixed by adding new code to the Agent which receives notification when the connection is released, and allows the Agent to remove the connection from its internal database. R: On connection shut down, a QMF Agent will properly release all resources associated with the connection. The Agent will no longer attempt to access "stale" connection data - such as the mutex - which would cause a crash. -Resolved crash in QMF Management Agent (c++) during connection shutdown (489557) -MORE INFORMATION REQUIRED FOR RELNOTE.+A locked mutex was being destroyed during connection shutdown, which caused the MgmtMasterPlugin on condor_master to crash. New code has been added to the agent, which allows it to remove the connection from its internal database. QMF agents now properly release all resources associated with the connection, preventing it from crashing. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-1633.html |