Bug 610493 - qpidd broker aborts in qpid::sys::Thread::join() / qpid::sys::assertClusterSafe() during 'furious' shutdown
Summary: qpidd broker aborts in qpid::sys::Thread::join() / qpid::sys::assertClusterSa...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: Development
Hardware: All
OS: Linux
high
high
Target Milestone: 1.3
: ---
Assignee: Alan Conway
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-07-02 08:52 UTC by Frantisek Reznicek
Modified: 2015-11-16 01:12 UTC (History)
3 users (show)

Fixed In Version: 0.10
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-02-25 10:44:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
The issue reproducer including abort backtraces (51.33 KB, application/x-tbz)
2010-07-02 08:54 UTC, Frantisek Reznicek
no flags Details
The issue reproducer including abort backtraces (52.03 KB, application/x-tbz)
2010-07-07 08:23 UTC, Frantisek Reznicek
no flags Details

Description Frantisek Reznicek 2010-07-02 08:52:53 UTC
Description of problem:

The initial testing scenario was to test proper persistance of the queue / exchange and messages in standalone and clustered environment.

By mistake I forgot to supply mrg_kill_process_id() function which is responsible for killing the process parameter --wait-for-exit=N to wait after one signal is sent to the process. This mistake resulted in 'furious qpidd shutdown' i.e. passing following commands shortly with minimal time gap (approx 0.05s):
kill -2 <qpidd_pid>
kill -15 <qpidd_pid>
kill -9 <qpidd_pid>

Unintentionally I was able to stress qpidd shutdown process for handling multiple signals withing short period.

I found RHEL4 x86_64 broker aborting in:
  #0  0x0000003f3002e32d in *__GI_raise (sig=Variable "sig" is not available.
  )
      at ../nptl/sysdeps/unix/sysv/linux/raise.c:67
  #1  0x0000003f3002fb2e in *__GI_abort () at ../sysdeps/generic/abort.c:88
  #2  0x0000003f344b1148 in __gnu_cxx::__verbose_terminate_handler ()
    from /usr/lib64/libstdc++.so.6
  #3  0x0000003f344af176 in __cxa_call_unexpected ()
    from /usr/lib64/libstdc++.so.6
  #4  0x0000003f344af1a3 in std::terminate () from /usr/lib64/libstdc++.so.6
  #5  0x0000003f344af2a3 in __cxa_throw () from /usr/lib64/libstdc++.so.6
  #6  0x0000002a95aed3aa in qpid::sys::Thread::join ()
    from /usr/lib64/libqpidcommon.so.3
  #7  0x0000002a95bda139 in qpid::sys::Timer::stop ()
    from /usr/lib64/libqpidcommon.so.3
  #8  0x0000002a956cb672 in qpid::broker::Broker::~Broker$delete ()
    from /usr/lib64/libqpidbroker.so.3
  #9  0x0000002a957cebb0 in qpid::broker::SignalHandler::shutdownHandler ()
    from /usr/lib64/libqpidbroker.so.3
  #10 <signal handler called>

and clustered RHEL5 i386 case in:
  #0  0x001bc410 in __kernel_vsyscall ()
  #1  0x00b6adf0 in raise () from /lib/libc.so.6
  #2  0x00b6c701 in abort () from /lib/libc.so.6
  #3  0x003878e3 in qpid::sys::assertClusterSafe() ()
    from /usr/lib/libqpidcommon.so.3
  #4  0x0097ae1c in qpid::broker::Queue::requeue(qpid::broker::QueuedMessage const&) () from /usr/lib/libqpidbroker.so.3
  #5  0x0091df3f in qpid::broker::DeliveryRecord::requeue() const ()
    from /usr/lib/libqpidbroker.so.3
  #6  0x009b6440 in qpid::broker::SemanticState::recover(bool) ()
    from /usr/lib/libqpidbroker.so.3
  #7  0x009b693b in qpid::broker::SemanticState::closed() ()
    from /usr/lib/libqpidbroker.so.3
  #8  0x009d739e in qpid::broker::SessionState::~SessionState() ()
    from /usr/lib/libqpidbroker.so.3
  #9  0x009d1882 in qpid::broker::SessionHandler::~SessionHandler() ()
    from /usr/lib/libqpidbroker.so.3
  #10 0x009122f4 in qpid::broker::Connection::~Connection() ()
    from /usr/lib/libqpidbroker.so.3
  #11 0x00786d71 in qpid::cluster::Connection::~Connection() ()
    from /usr/lib/qpid/daemon/cluster.so
  #12 0x008f1385 in qpid::RefCounted::released() const ()
    from /usr/lib/libqpidbroker.so.3
  #13 0x0078d49a in qpid::cluster::ConnectionCodec::~ConnectionCodec() ()
    from /usr/lib/qpid/daemon/cluster.so
  #14 0x009a9d36 in qpid::broker::SecureConnection::~SecureConnection() ()
    from /usr/lib/libqpidbroker.so.3
  #15 0x00385e19 in qpid::sys::AsynchIOHandler::~AsynchIOHandler() ()
    from /usr/lib/libqpidcommon.so.3


This kind of shutdown might not be valid, but it is worth to look at the path above / revise shutdown process to check holes there.

Currently put as high/high for 1.3, feel free to modify based on your judgement.

Version-Release number of selected component (if applicable):
python-qpid-0.7.946106-4.el5
qpid-cpp-client-0.7.946106-6.el5
qpid-cpp-client-devel-0.7.946106-6.el5
qpid-cpp-client-devel-docs-0.7.946106-6.el5
qpid-cpp-client-ssl-0.7.946106-6.el5
qpid-cpp-mrg-debuginfo-0.7.946106-6.el5
qpid-cpp-server-0.7.946106-6.el5
qpid-cpp-server-cluster-0.7.946106-6.el5
qpid-cpp-server-devel-0.7.946106-6.el5
qpid-cpp-server-ssl-0.7.946106-6.el5
qpid-cpp-server-store-0.7.946106-6.el5
qpid-cpp-server-xml-0.7.946106-6.el5
qpid-java-client-0.7.946106-5.el5
qpid-java-common-0.7.946106-5.el5
qpid-tools-0.7.946106-6.el5


How reproducible:
hard (~5%)

Steps to Reproduce:
1. rhel4 and rhel5 standalone case:
   :>./run.loG ; while true; do ./run.sh || break; done | tee ./log.loG
   rhel5 clustered case
:>./run.loG ; while true; do ./run.sh $((2 + ${RANDOM}%6 )) || break; done | tee run.loG
2. wait for abort
  
Actual results:
qpidd broker rarely aborts during shutdown as result of 'furious kill'

Expected results:
This is the question. I'm honestly not sure, but at least in case of all signals except SIGKILL proper handling is posible (but most probably extremely complex)

Additional info:

Comment 1 Frantisek Reznicek 2010-07-02 08:54:02 UTC
Created attachment 428775 [details]
The issue reproducer including abort backtraces

Comment 3 Frantisek Reznicek 2010-07-07 08:23:59 UTC
Created attachment 429995 [details]
The issue reproducer including abort backtraces

More info on attachment:
comment #1 links two directories qpidd_abort_01 and qpidd_abort_03 which are two instances of the 'similar' reproducer one for RHEL4 x86_64 case (qpidd_abort_03) the second one for RHEL5 i386 case (qpidd_abort_01).

I revisited the attached code and symlinked code to be more self explainable, moreover I found that qpidd_abort_03 repro already contained the correction to wait for program's reaction on sent signal, which I reverted.

The new attachment replaces the old one.

Comment 4 Alan Conway 2010-07-08 15:45:22 UTC
Fixed on trunk r961814 mrg_1.3.x branch:
http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=860119742313b551d3ee9bf116d398bc04129675

Comment 5 Frantisek Reznicek 2010-07-12 12:16:59 UTC
The RHEL4 x86_64's case is reproducible using MRG/Messaging/qpid_ptest_broker_cmdline_params test as well (still on -6).

Comment 6 Frantisek Reznicek 2010-07-13 06:42:54 UTC
The RHEL4 x86_64's case is rapidly reproducible using MRG/Messaging/qpid_ptest_broker_cmdline_params test which does not stress broker by 'furious kill signals' (signals are send with some few seconds duration).
So the issue is reproducible more rapidly than specified in description.


Note You need to log in before you can comment on or make changes to this bug.