Description of problem: There are observed quite frequent crashes of qpidd+msgstore in mrg::msgstore::InactivityFireEvent::~InactivityFireEvent() call, precisely: Thread 1 (Thread 29596): #0 0x004166f1 in _int_free () from /lib/libc.so.6 #1 0x004169e9 in free () from /lib/libc.so.6 #2 0x00bcd581 in operator delete(void*) () from /usr/lib/libstdc++.so.6 #3 0x005f55b3 in mrg::msgstore::InactivityFireEvent::~InactivityFireEvent ( this=0x4f0140, __in_chrg=<value optimized out>) at JournalImpl.h:55 #4 0x05867bd5 in qpid::RefCounted::released (this=0x8a0d090) at qpid/RefCounted.h:48 #5 0x005f2c3a in release (this=0x8a0c898, __in_chrg=<value optimized out>) at /usr/src/debug/qpid-cpp-mrg-0.7.946106/cpp/src/qpid/RefCounted.h:42 #6 intrusive_ptr_release (this=0x8a0c898, __in_chrg=<value optimized out>) at /usr/src/debug/qpid-cpp-mrg-0.7.946106/cpp/src/qpid/RefCounted.h:57 #7 ~intrusive_ptr (this=0x8a0c898, __in_chrg=<value optimized out>) at /usr/include/boost/intrusive_ptr.hpp:83 #8 mrg::msgstore::JournalImpl::~JournalImpl (this=0x8a0c898, __in_chrg=<value optimized out>) at JournalImpl.cpp:128 ... The crash was seen in the automated test which was stressing msgstore module by restarting broker with different msgstore arguments/parameters and message flow in / out after restarts. Similar crash was observed another situation (B]) when broker was joining cluster of one node and the only node was under heavy stress. The issue was seen on RHEL 5.5 i386. Version-Release number of selected component (if applicable): condor-qmf-7.4.3-0.14.el5 openais-0.80.6-16.el5_5.1 openais-debuginfo-0.80.6-16.el5_5.1 openais-devel-0.80.6-16.el5_5.1 python-qmf-0.7.946106-3.el5 python-qpid-0.7.946106-1.el5 qmf-0.7.946106-1.el5 qmf-devel-0.7.946106-1.el5 qpid-cpp-client-0.7.946106-1.el5 qpid-cpp-client-devel-0.7.946106-1.el5 qpid-cpp-client-devel-docs-0.7.946106-1.el5 qpid-cpp-client-ssl-0.7.946106-1.el5 qpid-cpp-server-0.7.946106-1.el5 qpid-cpp-server-cluster-0.7.946106-1.el5 qpid-cpp-server-devel-0.7.946106-1.el5 qpid-cpp-server-ssl-0.7.946106-1.el5 qpid-cpp-server-store-0.7.946106-1.el5 qpid-cpp-server-xml-0.7.946106-1.el5 qpid-java-client-0.7.946106-1.el5 qpid-java-common-0.7.946106-1.el5 qpid-tools-0.7.946106-2.el5 How reproducible: quite rapidly (>80%) Steps to Reproduce: A] ---------------------- 1. run MRG/Messaging/qpid_ptest_restart_with_msgstore_sweep 2. wait for crashes B] ---------------------- (alternative way) 1. follow https://bugzilla.redhat.com/show_bug.cgi?id=587190#c4 steps to run the test on one machine 2. set-up second machine with same openais config and run broker from the second machine to join the same cluster while test on the first machine is running for instance by: qpidd --cluster-name X --log-enable=info+ --data-dir=xx --truncate=yes Actual results: Broker crashes. Expected results: Broker should not crash. Additional info: case A] ./core.26426: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from 'qpidd' GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5) Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "i386-redhat-linux-gnu". For bug reporting instructions, please see: warning: section .gnu.liblist not found in /usr/lib/debug/usr/sbin/qpidd.debug warning: section .dynbss not found in /usr/lib/debug/usr/sbin/qpidd.debug warning: section .gnu.conflict not found in /usr/lib/debug/usr/sbin/qpidd.debug warning: section .gnu.liblist not found in /usr/lib/debug/usr/sbin/qpidd.debug warning: section .dynbss not found in /usr/lib/debug/usr/sbin/qpidd.debug warning: section .gnu.conflict not found in /usr/lib/debug/usr/sbin/qpidd.debug warning: .dynamic section for "/lib/libdl.so.2" is not at the expected address warning: difference appears to be caused by prelink, adjusting expectations warning: .dynamic section for "/lib/librt.so.1" is not at the expected address warning: difference appears to be caused by prelink, adjusting expectations warning: .dynamic section for "/lib/libm.so.6" is not at the expected address warning: difference appears to be caused by prelink, adjusting expectations warning: .dynamic section for "/lib/libc.so.6" is not at the expected address warning: difference appears to be caused by prelink, adjusting expectations warning: .dynamic section for "/lib/libpthread.so.0" is not at the expected address warning: difference appears to be caused by prelink, adjusting expectations warning: .dynamic section for "/usr/lib/openais/libcpg.so.2" is not at the expected address warning: difference appears to be caused by prelink, adjusting expectations warning: .dynamic section for "/usr/lib/libxqilla.so.3" is not at the expected address warning: difference appears to be caused by prelink, adjusting expectations warning: section .gnu.liblist not found in /usr/lib/debug/usr/sbin/qpidd.debug warning: section .dynbss not found in /usr/lib/debug/usr/sbin/qpidd.debug warning: section .gnu.conflict not found in /usr/lib/debug/usr/sbin/qpidd.debug Core was generated by `/usr/sbin/qpidd --data-dir /mnt/tests/distribution/MRG/Messaging/qpid_ptest_res'. Program terminated with signal 11, Segmentation fault. #0 0x002366f1 in _int_free () from /lib/libc.so.6 (gdb) eax 0x19960112 429261074 ecx 0x995c080 160809088 edx 0x995c080 160809088 ebx 0x30eff4 3207156 esp 0xbfa7250c 0xbfa7250c ebp 0xbfa72590 0xbfa72590 esi 0x995c018 160808984 edi 0x68 104 eip 0x2366f1 0x2366f1 <_int_free+2081> eflags 0x10206 [ PF IF RF ] cs 0x73 115 ss 0x7b 123 ds 0x7b 123 es 0xc040007b -1069547397 fs 0x0 0 gs 0x33 51 (gdb) Using memory regions provided by the target. There are no memory regions defined. (gdb) 32 AT_SYSINFO Special system info/entry points 0x1cc400 33 AT_SYSINFO_EHDR System-supplied DSO's ELF header 0x1cc000 16 AT_HWCAP Machine-dependent CPU capability hints 0xbfebfbff 6 AT_PAGESZ System page size 4096 17 AT_CLKTCK Frequency of times() 100 3 AT_PHDR Program headers for program 0x8048034 4 AT_PHENT Size of program header entry 32 5 AT_PHNUM Number of program headers 8 7 AT_BASE Base address of interpreter 0x0 8 AT_FLAGS Flags 0x0 9 AT_ENTRY Entry point of program 0x804c160 11 AT_UID Real user ID 0 12 AT_EUID Effective user ID 0 13 AT_GID Real group ID 0 14 AT_EGID Effective group ID 0 23 AT_SECURE Boolean, was exec setuid-like? 0 15 AT_PLATFORM String identifying platform 0xbfa72ccb "i686" 0 AT_NULL End of vector 0x0 (gdb) Stack level 0, frame at 0xbfa72598: eip = 0x2366f1 in _int_free; saved eip 0x2369e9 called by frame at 0xbfa725d0 Arglist at 0xbfa72590, args: Locals at 0xbfa72590, Previous frame's sp is 0xbfa72598 Saved registers: ebp at 0xbfa72590, esi at 0xbfa72588, edi at 0xbfa7258c, eip at 0xbfa72594 (gdb) From To Syms Read Shared Object Library 0x057e6c00 0x0599dbc4 Yes (*) /usr/lib/libqpidbroker.so.2 0x00480d70 0x005a0314 Yes (*) /usr/lib/libqpidcommon.so.2 0x006e2970 0x00701524 Yes (*) /usr/lib/libboost_program_options.so.2 0x006c69a0 0x006cf614 Yes (*) /usr/lib/libboost_filesystem.so.2 0x060e1ee0 0x060e3954 Yes (*) /lib/libuuid.so.1 0x00110a50 0x00111a84 Yes (*) /lib/libdl.so.2 0x00115880 0x00119b04 Yes (*) /lib/librt.so.1 0x0614c190 0x0615c774 Yes (*) /usr/lib/libsasl2.so.2 0x00b59c50 0x00bd5134 Yes (*) /usr/lib/libstdc++.so.6 0x00120410 0x0013b594 Yes (*) /lib/libm.so.6 0x008db660 0x008e2f34 Yes (*) /lib/libgcc_s.so.1 0x001e2c80 0x002ddc40 Yes (*) /lib/libc.so.6 0x0038f7f0 0x003a4ebf Yes (*) /lib/ld-linux.so.2 0x00148210 0x00153a74 Yes (*) /lib/libpthread.so.0 0x007740c0 0x0077f054 Yes (*) /lib/libresolv.so.2 0x00dcc6e0 0x00dd3314 Yes (*) /lib/libcrypt.so.1 0x00eed240 0x00ef5c84 Yes (*) /usr/lib/qpid/daemon/ssl.so 0x00f14970 0x00f2dd94 Yes (*) /usr/lib/libsslcommon.so.2 0x05fbed30 0x060a0304 Yes (*) /usr/lib/libnss3.so 0x060f0f50 0x06114824 Yes (*) /usr/lib/libssl3.so 0x05f56750 0x05f78c24 Yes (*) /usr/lib/libnspr4.so 0x05f8d980 0x05f98a54 Yes (*) /usr/lib/libnssutil3.so 0x05fa8df0 0x05faa704 Yes (*) /usr/lib/libplc4.so 0x05fa3a30 0x05fa49a4 Yes (*) /usr/lib/libplds4.so 0x00a1acf0 0x00abae34 Yes (*) /usr/lib/qpid/daemon/msgstore.so 0x007a3240 0x0086b554 Yes (*) /usr/lib/libdb_cxx-4.3.so 0x00f72390 0x00f726d0 Yes (*) /usr/lib/libaio.so.1 0x006a9240 0x006ae724 Yes (*) /usr/lib/qpid/daemon/replication_exchange.so 0x001610f0 0x001674c4 Yes (*) /usr/lib/qpid/daemon/replicating_listener.so 0x00c9d700 0x00d12e84 Yes (*) /usr/lib/qpid/daemon/cluster.so 0x0016be80 0x0016e024 Yes (*) /usr/lib/openais/libcpg.so.2 0x00655d30 0x00658894 Yes (*) /usr/lib/libcman.so.2 0x0091a5a0 0x0099fc34 Yes (*) /usr/lib/libqpidclient.so.2 0x00173f20 0x00177b64 Yes (*) /usr/lib/qpid/daemon/watchdog.so 0x00180a20 0x0018b854 Yes (*) /usr/lib/qpid/daemon/xml.so 0x05c720a0 0x05e2f534 Yes (*) /usr/lib/libxerces-c.so.28 0x01082c00 0x012381d4 Yes (*) /usr/lib/libxqilla.so.3 0x0019aae0 0x001bd9f4 Yes (*) /usr/lib/qpid/daemon/acl.so (*): Shared library is missing debugging information. (gdb) * 1 Thread 26426 0x002366f1 in _int_free () from /lib/libc.so.6 Thread 1 (Thread 26426): #0 0x002366f1 in _int_free () from /lib/libc.so.6 #1 0x002369e9 in free () from /lib/libc.so.6 #2 0x00bcd581 in operator delete(void*) () from /usr/lib/libstdc++.so.6 #3 0x00a275b3 in mrg::msgstore::InactivityFireEvent::~InactivityFireEvent ( this=0x310140, __in_chrg=<value optimized out>) at JournalImpl.h:55 #4 0x05867bd5 in qpid::RefCounted::released (this=0x995c080) at qpid/RefCounted.h:48 #5 0x00a24c3a in release (this=0x995b888, __in_chrg=<value optimized out>) at /usr/src/debug/qpid-cpp-mrg-0.7.946106/cpp/src/qpid/RefCounted.h:42 #6 intrusive_ptr_release (this=0x995b888, __in_chrg=<value optimized out>) at /usr/src/debug/qpid-cpp-mrg-0.7.946106/cpp/src/qpid/RefCounted.h:57 #7 ~intrusive_ptr (this=0x995b888, __in_chrg=<value optimized out>) at /usr/include/boost/intrusive_ptr.hpp:83 #8 mrg::msgstore::JournalImpl::~JournalImpl (this=0x995b888, __in_chrg=<value optimized out>) at JournalImpl.cpp:128 #9 0x00a4c6ec in mrg::msgstore::TplJournalImpl::~TplJournalImpl ( this=0x995b888, __in_chrg=<value optimized out>) at JournalImpl.h:243 #10 0x00a4d3a8 in checked_delete<mrg::msgstore::TplJournalImpl> ( this=0x99497d8) at /usr/include/boost/checked_delete.hpp:34 #11 boost::detail::sp_counted_impl_p<mrg::msgstore::TplJournalImpl>::dispose ( this=0x99497d8) at /usr/include/boost/detail/sp_counted_impl.hpp:76 #12 0x00a35b58 in release (this=0x99485f0, __in_chrg=<value optimized out>) at /usr/include/boost/detail/sp_counted_base_gcc_x86.hpp:145 #13 ~shared_count (this=0x99485f0, __in_chrg=<value optimized out>) at /usr/include/boost/detail/shared_count.hpp:159 #14 ~shared_ptr (this=0x99485f0, __in_chrg=<value optimized out>) at /usr/include/boost/shared_ptr.hpp:106 #15 mrg::msgstore::MessageStoreImpl::~MessageStoreImpl (this=0x99485f0, __in_chrg=<value optimized out>) at MessageStoreImpl.cpp:446 #16 0x00a1b428 in checked_delete<mrg::msgstore::MessageStoreImpl> ( this=0x9948750) at /usr/include/boost/checked_delete.hpp:34 #17 boost::detail::sp_counted_impl_p<mrg::msgstore::MessageStoreImpl>::dispose (this=0x9948750) at /usr/include/boost/detail/sp_counted_impl.hpp:76 #18 0x00a1b0e8 in ~shared_count () at /usr/include/boost/detail/sp_counted_base_gcc_x86.hpp:145 #19 ~shared_ptr () at /usr/include/boost/shared_ptr.hpp:106 #20 ~StorePlugin () at StorePlugin.cpp:36 #21 __tcf_1 () at StorePlugin.cpp:77 #22 0x001f8da9 in exit () from /lib/libc.so.6 #23 0x001e2ea4 in __libc_start_main () from /lib/libc.so.6 #24 0x0804c181 in _start () (gdb) quit All crashes here: https://beaker.engineering.redhat.com/logs/2010/33/833/1555/11309/45373///test_log--distribution-MRG-Messaging-qpid_ptest_restart_with_msgstore_sweep.log case B] [root@mrg-qe-01 bz587190]# ulimit -c unlimited [root@mrg-qe-01 bz587190]# qpidd --cluster-name X --log-enable=info+ --data-dir=xx --truncate=yes 2010-05-24 09:51:48 info Loaded Module: /usr/lib/qpid/daemon/msgstore.so 2010-05-24 09:51:48 info Loaded Module: /usr/lib/qpid/daemon/replicating_listener.so 2010-05-24 09:51:48 info Loaded Module: /usr/lib/qpid/daemon/ssl.so 2010-05-24 09:51:48 info Loaded Module: /usr/lib/qpid/daemon/cluster.so 2010-05-24 09:51:48 info Loaded Module: /usr/lib/qpid/daemon/watchdog.so 2010-05-24 09:51:48 info Loaded Module: /usr/lib/qpid/daemon/replication_exchange.so 2010-05-24 09:51:48 info Loaded Module: /usr/lib/qpid/daemon/xml.so 2010-05-24 09:51:48 info Loaded Module: /usr/lib/qpid/daemon/acl.so 2010-05-24 09:51:48 info No message store configured, persistence is disabled. 2010-05-24 09:51:48 info Management enabled 2010-05-24 09:51:48 notice Store directory xx/rhm was truncated. 2010-05-24 09:51:48 notice Journal "TplStore": Created 2010-05-24 09:51:48 notice Store module initialized; store-dir=xx 2010-05-24 09:51:48 info > Default files per journal: 8 2010-05-24 09:51:48 info > Default journal file size: 24 (wpgs) 2010-05-24 09:51:48 info > Default write cache page size: 32 (Kib) 2010-05-24 09:51:48 info > Default number of write cache pages: 32 2010-05-24 09:51:48 info > TPL files per journal: 8 2010-05-24 09:51:48 info > TPL journal file size: 24 (wpgs) 2010-05-24 09:51:48 info > TPL write cache page size: 4 (Kib) 2010-05-24 09:51:48 info > TPL number of write cache pages: 64 2010-05-24 09:51:48 info Registered replication exchange 2010-05-24 09:51:48 info Registered xml exchange 2010-05-24 09:51:48 notice Initializing CPG 2010-05-24 09:51:48 notice Cluster store state: clean cluster-id=efc84c79-87a5-4c5c-8e13-8c2d0bf1bf37 shutdown-id=b5a1a179-e484-4915-a572-7bf7ae3e8e58 2010-05-24 09:51:48 notice cluster(10.34.45.1:4784 PRE_INIT) Membership update: 10.34.45.1:4784 10.34.45.2:30499 2010-05-24 09:51:48 notice cluster(10.34.45.1:4784 PRE_INIT) Members joined: 10.34.45.1:4784 2010-05-24 09:51:48 notice Cluster recovery: recovered journal data discarded and journal files pushed down 2010-05-24 09:51:48 notice Store directory xx/rhm was pushed down into directory xx/_cluster.bak.0003. 2010-05-24 09:51:49 notice Journal "TplStore": Created 2010-05-24 09:51:49 notice SASL disabled: No Authentication Performed 2010-05-24 09:51:49 notice Listening on TCP port 5672 2010-05-24 09:51:49 info SSL plugin not enabled, you must set --ssl-cert-db to enable it. 2010-05-24 09:51:49 info Policy file not specified. ACL Disabled, no ACL checking being done! 2010-05-24 09:51:49 critical Unexpected error: Cluster-ID mismatch. Stores belong to different clusters. Segmentation fault (core dumped) [root@mrg-qe-01 bz587190]# file core.4784 core.4784: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from 'qpidd' [root@mrg-qe-01 bz587190]# gdb `which qpidd` core.4784 GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5) Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "i386-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/qpidd...(no debugging symbols found)...done. ... Loaded symbols for /usr/lib/qpid/daemon/acl.so Core was generated by `qpidd --cluster-name X --log-enable=info+ --data-dir=xx --truncate=yes'. Program terminated with signal 11, Segmentation fault. #0 0x002906f1 in _int_free () from /lib/libc.so.6 (gdb) info threads * 1 Thread 4784 0x002906f1 in _int_free () from /lib/libc.so.6 (gdb) bt #0 0x002906f1 in _int_free () from /lib/libc.so.6 #1 0x002909e9 in free () from /lib/libc.so.6 #2 0x0724e581 in operator delete(void*) () from /usr/lib/libstdc++.so.6 #3 0x001475b3 in mrg::msgstore::InactivityFireEvent::~InactivityFireEvent() () from /usr/lib/qpid/daemon/msgstore.so #4 0x004d8bd5 in qpid::RefCounted::released() const () from /usr/lib/libqpidbroker.so.2 #5 0x00144c3a in mrg::msgstore::JournalImpl::~JournalImpl() () from /usr/lib/qpid/daemon/msgstore.so #6 0x0016c6ec in mrg::msgstore::TplJournalImpl::~TplJournalImpl() () from /usr/lib/qpid/daemon/msgstore.so #7 0x0016d3a8 in boost::detail::sp_counted_impl_p<mrg::msgstore::TplJournalImpl>::dispose() () from /usr/lib/qpid/daemon/msgstore.so #8 0x00155b58 in mrg::msgstore::MessageStoreImpl::~MessageStoreImpl() () from /usr/lib/qpid/daemon/msgstore.so #9 0x0013b428 in boost::detail::sp_counted_impl_p<mrg::msgstore::MessageStoreImpl>::dispose() () from /usr/lib/qpid/daemon/msgstore.so #10 0x0013b0e8 in mrg::journal::lpmgr::finalize () from /usr/lib/qpid/daemon/msgstore.so #11 0x00252da9 in exit () from /lib/libc.so.6 #12 0x0023cea4 in __libc_start_main () from /lib/libc.so.6 #13 0x0804c181 in std::ios_base::Init::~Init() () (gdb) quit
I can reproduce this error using a single machine and starting two nodes on it. However, the error only occurs when the cluster module fails. If clustering starts correctly/normally, then this is not seen. From comment #1 above: 2010-05-24 09:51:49 critical Unexpected error: Cluster-ID mismatch. Stores belong to different clusters. Segmentation fault (core dumped) From my own testing: 2010-05-26 15:58:46 critical Unexpected error: Cannot recover, no clean store. Segmentation fault (core dumped)
I have found cases where using ctrl-c to stop a broker can also cause this crash, even if the clustering starts normally.
Using valgrind on a scenario that produces this bug shows that there may be a broker shutdown ordering problem at play here. See bug 596765 for details.
If the above scenario is repeated with no management enabled (ie use the "-m no" option when starting the broker, the core does not occur. This increases the probability that bug 596765 is the cause of this.
I have confirmed that with the fix for bug 596765 in place, this problem is no longer seen.
The reproducer is the same as for bug 596765 so I conseder it being verified.