Description of problem: The initial testing scenario was to test proper persistance of the queue / exchange and messages in standalone and clustered environment. By mistake I forgot to supply mrg_kill_process_id() function which is responsible for killing the process parameter --wait-for-exit=N to wait after one signal is sent to the process. This mistake resulted in 'furious qpidd shutdown' i.e. passing following commands shortly with minimal time gap (approx 0.05s): kill -2 <qpidd_pid> kill -15 <qpidd_pid> kill -9 <qpidd_pid> Unintentionally I was able to stress qpidd shutdown process for handling multiple signals withing short period. I found RHEL4 x86_64 broker aborting in: #0 0x0000003f3002e32d in *__GI_raise (sig=Variable "sig" is not available. ) at ../nptl/sysdeps/unix/sysv/linux/raise.c:67 #1 0x0000003f3002fb2e in *__GI_abort () at ../sysdeps/generic/abort.c:88 #2 0x0000003f344b1148 in __gnu_cxx::__verbose_terminate_handler () from /usr/lib64/libstdc++.so.6 #3 0x0000003f344af176 in __cxa_call_unexpected () from /usr/lib64/libstdc++.so.6 #4 0x0000003f344af1a3 in std::terminate () from /usr/lib64/libstdc++.so.6 #5 0x0000003f344af2a3 in __cxa_throw () from /usr/lib64/libstdc++.so.6 #6 0x0000002a95aed3aa in qpid::sys::Thread::join () from /usr/lib64/libqpidcommon.so.3 #7 0x0000002a95bda139 in qpid::sys::Timer::stop () from /usr/lib64/libqpidcommon.so.3 #8 0x0000002a956cb672 in qpid::broker::Broker::~Broker$delete () from /usr/lib64/libqpidbroker.so.3 #9 0x0000002a957cebb0 in qpid::broker::SignalHandler::shutdownHandler () from /usr/lib64/libqpidbroker.so.3 #10 <signal handler called> and clustered RHEL5 i386 case in: #0 0x001bc410 in __kernel_vsyscall () #1 0x00b6adf0 in raise () from /lib/libc.so.6 #2 0x00b6c701 in abort () from /lib/libc.so.6 #3 0x003878e3 in qpid::sys::assertClusterSafe() () from /usr/lib/libqpidcommon.so.3 #4 0x0097ae1c in qpid::broker::Queue::requeue(qpid::broker::QueuedMessage const&) () from /usr/lib/libqpidbroker.so.3 #5 0x0091df3f in qpid::broker::DeliveryRecord::requeue() const () from /usr/lib/libqpidbroker.so.3 #6 0x009b6440 in qpid::broker::SemanticState::recover(bool) () from /usr/lib/libqpidbroker.so.3 #7 0x009b693b in qpid::broker::SemanticState::closed() () from /usr/lib/libqpidbroker.so.3 #8 0x009d739e in qpid::broker::SessionState::~SessionState() () from /usr/lib/libqpidbroker.so.3 #9 0x009d1882 in qpid::broker::SessionHandler::~SessionHandler() () from /usr/lib/libqpidbroker.so.3 #10 0x009122f4 in qpid::broker::Connection::~Connection() () from /usr/lib/libqpidbroker.so.3 #11 0x00786d71 in qpid::cluster::Connection::~Connection() () from /usr/lib/qpid/daemon/cluster.so #12 0x008f1385 in qpid::RefCounted::released() const () from /usr/lib/libqpidbroker.so.3 #13 0x0078d49a in qpid::cluster::ConnectionCodec::~ConnectionCodec() () from /usr/lib/qpid/daemon/cluster.so #14 0x009a9d36 in qpid::broker::SecureConnection::~SecureConnection() () from /usr/lib/libqpidbroker.so.3 #15 0x00385e19 in qpid::sys::AsynchIOHandler::~AsynchIOHandler() () from /usr/lib/libqpidcommon.so.3 This kind of shutdown might not be valid, but it is worth to look at the path above / revise shutdown process to check holes there. Currently put as high/high for 1.3, feel free to modify based on your judgement. Version-Release number of selected component (if applicable): python-qpid-0.7.946106-4.el5 qpid-cpp-client-0.7.946106-6.el5 qpid-cpp-client-devel-0.7.946106-6.el5 qpid-cpp-client-devel-docs-0.7.946106-6.el5 qpid-cpp-client-ssl-0.7.946106-6.el5 qpid-cpp-mrg-debuginfo-0.7.946106-6.el5 qpid-cpp-server-0.7.946106-6.el5 qpid-cpp-server-cluster-0.7.946106-6.el5 qpid-cpp-server-devel-0.7.946106-6.el5 qpid-cpp-server-ssl-0.7.946106-6.el5 qpid-cpp-server-store-0.7.946106-6.el5 qpid-cpp-server-xml-0.7.946106-6.el5 qpid-java-client-0.7.946106-5.el5 qpid-java-common-0.7.946106-5.el5 qpid-tools-0.7.946106-6.el5 How reproducible: hard (~5%) Steps to Reproduce: 1. rhel4 and rhel5 standalone case: :>./run.loG ; while true; do ./run.sh || break; done | tee ./log.loG rhel5 clustered case :>./run.loG ; while true; do ./run.sh $((2 + ${RANDOM}%6 )) || break; done | tee run.loG 2. wait for abort Actual results: qpidd broker rarely aborts during shutdown as result of 'furious kill' Expected results: This is the question. I'm honestly not sure, but at least in case of all signals except SIGKILL proper handling is posible (but most probably extremely complex) Additional info:
Created attachment 428775 [details] The issue reproducer including abort backtraces
Created attachment 429995 [details] The issue reproducer including abort backtraces More info on attachment: comment #1 links two directories qpidd_abort_01 and qpidd_abort_03 which are two instances of the 'similar' reproducer one for RHEL4 x86_64 case (qpidd_abort_03) the second one for RHEL5 i386 case (qpidd_abort_01). I revisited the attached code and symlinked code to be more self explainable, moreover I found that qpidd_abort_03 repro already contained the correction to wait for program's reaction on sent signal, which I reverted. The new attachment replaces the old one.
Fixed on trunk r961814 mrg_1.3.x branch: http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=860119742313b551d3ee9bf116d398bc04129675
The RHEL4 x86_64's case is reproducible using MRG/Messaging/qpid_ptest_broker_cmdline_params test as well (still on -6).
The RHEL4 x86_64's case is rapidly reproducible using MRG/Messaging/qpid_ptest_broker_cmdline_params test which does not stress broker by 'furious kill signals' (signals are send with some few seconds duration). So the issue is reproducible more rapidly than specified in description.