Bug 614944

Summary: qpidd broker crash in mrg::msgstore::TxnCtxt::abort() -> DbTxn::abort()
Product: Red Hat Enterprise MRG Reporter: Frantisek Reznicek <freznice>
Component: qpid-cppAssignee: Kim van der Riet <kim.vdriet>
Status: CLOSED ERRATA QA Contact: Petr Matousek <pematous>
Severity: high Docs Contact:
Priority: high    
Version: DevelopmentCC: esammons, gsim, jross, kim.vdriet, pematous
Target Milestone: 2.0   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: qpid-cpp-mrg-0.9.1079953 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-06-23 15:45:35 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Frantisek Reznicek 2010-07-15 15:37:51 UTC
Description of problem:

While stressing qpidd message store according bug 456272 on a 32 core system I found qpidd crashing in this path:

  #0  0x00002ba97da13440 in DbTxn::abort() () from /usr/lib64/libdb_cxx-4.3.so
  #1  0x00002ba97d74e01f in mrg::msgstore::TxnCtxt::abort (this=0x7fffa26912f0)
      at TxnCtxt.cpp:131
  #2  0x00002ba97d74ebb5 in mrg::msgstore::TxnCtxt::~TxnCtxt (this=0x1d2e9440, 
      __in_chrg=<value optimized out>) at TxnCtxt.cpp:77
  #3  0x00002ba97d72cc54 in mrg::msgstore::MessageStoreImpl::init (
      this=0x1d2e9860) at MessageStoreImpl.cpp:359

The client accesses message store (create queues) in parallel in multiple threads, see client below linked.

There are were seen 2 crashes (rhel5.5 i386) in hundreds of runs, so most probably it is more likely happening on machines like this:
processor	: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
vendor_id	: GenuineIntel
model name	: Genuine Intel(R) CPU             0000 @ 1.73GHz
cpu cores	: 8
cpu MHz		: 1064.000
bogomips	: 3466.75 3466.73 3466.73 3466.58 3466.63 3466.68 3466.73 3466.73 3466.67 3466.71 3466.66 3466.82 3466.82 3466.76 3466.71 3466.67 3466.67 3466.65 3466.80 3466.80 3466.65 3466.65 3466.63 3466.84 3466.80 3466.82 3466.71 3466.66 3466.77 3466.70 3466.81 3466.71
             total       used       free     shared    buffers     cached
Mem:      16404480     873048   15531432          0     113264     451696
-/+ buffers/cache:     308088   16096392
Swap:     14942200          0   14942200


Version-Release number of selected component (if applicable):
> python-qmf-0.7.946106-5.el5
> python-qpid-0.7.946106-4.el5
> python-saslwrapper-0.1.934605-2.el5
> qmf-0.7.946106-6.el5
> qmf-devel-0.7.946106-6.el5
> qpid-cpp-client-0.7.946106-6.el5
> qpid-cpp-client-devel-0.7.946106-6.el5
> qpid-cpp-client-devel-docs-0.7.946106-6.el5
> qpid-cpp-client-rdma-0.7.946106-6.el5
> qpid-cpp-client-ssl-0.7.946106-6.el5
> qpid-cpp-mrg-debuginfo-0.7.946106-6.el5
> qpid-cpp-server-0.7.946106-6.el5
> qpid-cpp-server-cluster-0.7.946106-6.el5
> qpid-cpp-server-devel-0.7.946106-6.el5
> qpid-cpp-server-rdma-0.7.946106-6.el5
> qpid-cpp-server-ssl-0.7.946106-6.el5
> qpid-cpp-server-store-0.7.946106-6.el5
> qpid-cpp-server-xml-0.7.946106-6.el5
> qpid-dotnet-0.4.738274-2.el5
> qpid-java-client-0.7.946106-5.el5
> qpid-java-common-0.7.946106-5.el5
> qpid-tests-0.7.946106-1.el5
> qpid-tools-0.7.946106-6.el5


How reproducible:
very hard ~ 1%

Steps to Reproduce:
1. start the broker
2. run background find to load the system (find..... &)
3. run client bz456272.py on the broker in the loop
 see http://cvs.devel.redhat.com/cgi-bin/cvsweb.cgi/tests/distribution/MRG/Messaging/qpid_broker_jfail_bz456272/

Actual results:
qpidd crashes.

Expected results:
qpidd should not crash.

Additional info:

[08:49:39] Core file: ./core.25515 generated by /usr/sbin/qpidd ----------------------1/2-
./core.25515: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from 'qpidd'
  GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5)
  Copyright (C) 2009 Free Software Foundation, Inc.
  License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
  This is free software: you are free to change and redistribute it.
  There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
  and "show warranty" for details.
  This GDB was configured as "x86_64-redhat-linux-gnu".
  For bug reporting instructions, please see:
  warning: section .gnu.liblist not found in /usr/lib/debug/usr/sbin/qpidd.debug
  warning: section .dynbss not found in /usr/lib/debug/usr/sbin/qpidd.debug
  warning: section .gnu.conflict not found in /usr/lib/debug/usr/sbin/qpidd.debug
  [New Thread 25526]
  warning: section .gnu.liblist not found in /usr/lib/debug/usr/sbin/qpidd.debug
  warning: section .dynbss not found in /usr/lib/debug/usr/sbin/qpidd.debug
  warning: section .gnu.conflict not found in /usr/lib/debug/usr/sbin/qpidd.debug
  warning: section .gnu.liblist not found in /usr/lib/debug/usr/sbin/qpidd.debug
  warning: section .dynbss not found in /usr/lib/debug/usr/sbin/qpidd.debug
  warning: section .gnu.conflict not found in /usr/lib/debug/usr/sbin/qpidd.debug
  Core was generated by `/usr/sbin/qpidd --log-enable info+ --auth no --tpl-wcache-page-size 128 --tpl-j'.
  Program terminated with signal 11, Segmentation fault.
  #0  0x00002ba8243e9440 in DbTxn::abort() () from /usr/lib64/libdb_cxx-4.3.so
  (gdb) rax            0x0	0
  rbx            0x9234440	153306176
  rcx            0x92347e0	153307104
  rdx            0x0	0
  rsi            0x1	1
  rdi            0x9234440	153306176
  rbp            0x7fff65c39350	0x7fff65c39350
  rsp            0x7fff65c38e50	0x7fff65c38e50
  r8             0x0	0
  r9             0x2ba824102af9	48001159539449
  r10            0x2ba8240b1000	48001159204864
  r11            0x3c1e2bced0	258204225232
  r12            0x9234860	153307232
  r13            0x2ba82439aa40	48001162259008
  r14            0x9230028	153288744
  r15            0x9230088	153288840
  rip            0x2ba8243e9440	0x2ba8243e9440 <DbTxn::abort()+32>
  eflags         0x10202	[ IF RF ]
  cs             0x33	51
  ss             0x2b	43
  ds             0x0	0
  es             0x0	0
  fs             0x0	0
  gs             0x0	0
  fctrl          0x37f	895
  fstat          0x0	0
  ftag           0xffff	65535
  fiseg          0x0	0
  fioff          0x0	0
  foseg          0x0	0
  fooff          0x0	0
  fop            0x0	0
  mxcsr          0x1f80	[ IM DM ZM OM UM PM ]
  (gdb) Using memory regions provided by the target.
  There are no memory regions defined.
  (gdb) 16   AT_HWCAP             Machine-dependent CPU capability hints 0xbfebfbff
  6    AT_PAGESZ            System page size               4096
  17   AT_CLKTCK            Frequency of times()           100
  3    AT_PHDR              Program headers for program    0x400040
  4    AT_PHENT             Size of program header entry   56
  5    AT_PHNUM             Number of program headers      8
  7    AT_BASE              Base address of interpreter    0x0
  8    AT_FLAGS             Flags                          0x0
  9    AT_ENTRY             Entry point of program         0x404ff0
  11   AT_UID               Real user ID                   0
  12   AT_EUID              Effective user ID              0
  13   AT_GID               Real group ID                  0
  14   AT_EGID              Effective group ID             0
  23   AT_SECURE            Boolean, was exec setuid-like? 0
  15   AT_PLATFORM          String identifying platform    0x7fff65c3c029 "x86_64"
  0    AT_NULL              End of vector                  0x0
  (gdb) Stack level 0, frame at 0x7fff65c38e70:
   rip = 0x2ba8243e9440 in DbTxn::abort(); saved rip 0x2ba82412401f
   called by frame at 0x7fff65c38e90
   Arglist at 0x7fff65c38e48, args: 
   Locals at 0x7fff65c38e48, Previous frame's sp is 0x7fff65c38e70
   Saved registers:
    rbx at 0x7fff65c38e50, rbp at 0x7fff65c38e58, r12 at 0x7fff65c38e60,
    rip at 0x7fff65c38e68
  (gdb) From                To                  Syms Read   Shared Object Library
  0x0000003c0dea14b0  0x0000003c0e022a78  Yes (*)     /usr/lib64/libqpidbroker.so.3
  0x0000003c0d50dd60  0x0000003c0d60f6c8  Yes (*)     /usr/lib64/libqpidcommon.so.3
  0x0000003c0cc10aa0  0x0000003c0cc2dae8  Yes (*)     /usr/lib64/libboost_program_options.so.2
  0x0000003c0da04810  0x0000003c0da0cff8  Yes (*)     /usr/lib64/libboost_filesystem.so.2
  0x0000003c20601500  0x0000003c20602918  Yes (*)     /lib64/libuuid.so.1
  0x0000003c0c400e10  0x0000003c0c401a08  Yes (*)     /lib64/libdl.so.2
  0x0000003c0d002220  0x0000003c0d005cc8  Yes (*)     /lib64/librt.so.1
  0x0000003c20a046e0  0x0000003c20a13be8  Yes (*)     /usr/lib64/libsasl2.so.2
  0x0000003c1e24f430  0x0000003c1e2c3058  Yes (*)     /usr/lib64/libstdc++.so.6
  0x0000003c0c003e60  0x0000003c0c043e38  Yes (*)     /lib64/libm.so.6
  0x0000003c19a01e50  0x0000003c19a0b018  Yes (*)     /lib64/libgcc_s.so.1
  0x0000003c0bc1d780  0x0000003c0bd09ff8  Yes (*)     /lib64/libc.so.6
  0x0000003c0b800a70  0x0000003c0b81671e  Yes (*)     /lib64/ld-linux-x86-64.so.2
  0x0000003c0c8051f0  0x0000003c0c810258  Yes (*)     /lib64/libpthread.so.0
  0x0000003c110032a0  0x0000003c1100e2d8  Yes (*)     /lib64/libresolv.so.2
  0x0000003c1e6009f0  0x0000003c1e606918  Yes (*)     /lib64/libcrypt.so.1
  0x00002ba8213605e0  0x00002ba821363b68  Yes (*)     /usr/lib64/qpid/daemon/watchdog.so
  0x00002ba82156e7b0  0x00002ba821578568  Yes (*)     /usr/lib64/qpid/daemon/xml.so
  0x00002ba821904070  0x00002ba821a85758  Yes (*)     /usr/lib64/libxerces-c.so.28
  0x00002ba821f1a090  0x00002ba8220a5b28  Yes (*)     /usr/lib64/libxqilla.so.3
  0x00002ba8223ebf10  0x00002ba82240c4e8  Yes (*)     /usr/lib64/qpid/daemon/acl.so
  0x00002ba82261f570  0x00002ba822627728  Yes (*)     /usr/lib64/qpid/daemon/ssl.so
  0x00002ba82284d640  0x00002ba8228640a8  Yes (*)     /usr/lib64/libsslcommon.so.3
  0x0000003c1ea18080  0x0000003c1eaf4968  Yes (*)     /usr/lib64/libnss3.so
  0x0000003c1fe07ee0  0x0000003c1fe28ae8  Yes (*)     /usr/lib64/libssl3.so
  0x0000003c1fa0ced0  0x0000003c1fa2b018  Yes (*)     /usr/lib64/libnspr4.so
  0x0000003c1f2080d0  0x0000003c1f212428  Yes (*)     /usr/lib64/libnssutil3.so
  0x0000003c1ee01370  0x0000003c1ee02978  Yes (*)     /usr/lib64/libplc4.so
  0x0000003c1f600e10  0x0000003c1f601c08  Yes (*)     /usr/lib64/libplds4.so
  0x00002ba822a73b00  0x00002ba822a7b5a8  Yes (*)     /usr/lib64/qpid/daemon/rdma.so
  0x00002ba822c9f270  0x00002ba822cac928  Yes (*)     /usr/lib64/librdmawrap.so.3
  0x00002ba822eb5360  0x00002ba822ebbe38  Yes (*)     /usr/lib64/libibverbs.so.1
  0x00002ba8230c0580  0x00002ba8230c2f88  Yes (*)     /usr/lib64/librdmacm.so.1
  0x00002ba8232cab70  0x00002ba8232d0648  Yes (*)     /usr/lib64/qpid/daemon/replicating_listener.so
  0x00002ba823515330  0x00002ba82357ece8  Yes (*)     /usr/lib64/qpid/daemon/cluster.so
  0x00002ba8237ba3d0  0x00002ba8237bc338  Yes (*)     /usr/lib64/openais/libcpg.so.2
  0x00002ba8239be110  0x00002ba8239c0b78  Yes (*)     /usr/lib64/libcman.so.2
  0x00002ba823c01630  0x00002ba823c7a1d8  Yes (*)     /usr/lib64/libqpidclient.so.3
  0x00002ba823ea9610  0x00002ba823eadbd8  Yes (*)     /usr/lib64/qpid/daemon/replication_exchange.so
  0x00002ba8240e8e00  0x00002ba8241765c8  Yes (*)     /usr/lib64/qpid/daemon/msgstore.so
  0x00002ba8243df5d0  0x00002ba824498288  Yes (*)     /usr/lib64/libdb_cxx-4.3.so
  0x00002ba8246c3510  0x00002ba8246c36d1  Yes (*)     /usr/lib64/libaio.so.1
  (*): Shared library is missing debugging information.
  (gdb)   2 Thread 25526  0x0000003c0c80b150 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
     from /lib64/libpthread.so.0
  * 1 Thread 25515  0x00002ba8243e9440 in DbTxn::abort() ()
     from /usr/lib64/libdb_cxx-4.3.so
  Thread 2 (Thread 25526):
  #0  0x0000003c0c80b150 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
     from /lib64/libpthread.so.0
  #1  0x0000003c0d5fafdf in qpid::sys::Timer::run (this=0x921aa20)
      at ../include/qpid/sys/posix/Condition.h:69
  #2  0x0000003c0d52286a in qpid::sys::(anonymous namespace)::runRunnable (
      p=0x921aa54) at qpid/sys/posix/Thread.cpp:35
  #3  0x0000003c0c80673d in start_thread () from /lib64/libpthread.so.0
  #4  0x0000003c0bcd3d1d in clone () from /lib64/libc.so.6
  Thread 1 (Thread 25515):
  #0  0x00002ba8243e9440 in DbTxn::abort() () from /usr/lib64/libdb_cxx-4.3.so
  #1  0x00002ba82412401f in mrg::msgstore::TxnCtxt::abort (this=0x7fff65c39350)
      at TxnCtxt.cpp:131
  #2  0x00002ba824124bb5 in mrg::msgstore::TxnCtxt::~TxnCtxt (this=0x9234440, 
      __in_chrg=<value optimized out>) at TxnCtxt.cpp:77
  #3  0x00002ba824102c54 in mrg::msgstore::MessageStoreImpl::init (
      this=0x9234860) at MessageStoreImpl.cpp:359
  #4  0x00002ba824108459 in mrg::msgstore::MessageStoreImpl::init (
      this=0x9230000, dir=..., jfiles=16, jfileSizePgs=32, truncateFlag=false, 
      wCachePageSizeKib=32, tplJfiles=8, tplJfileSizePgs=32, 
      tplWCachePageSizeKib=128, autoJExpand=false, autoJExpandMaxFiles=0)
      at MessageStoreImpl.cpp:289
  #5  0x00002ba824109098 in mrg::msgstore::MessageStoreImpl::init (
      this=0x9230000, options=<value optimized out>) at MessageStoreImpl.cpp:256
  #6  0x00002ba8240e9485 in qpid::broker::StorePlugin::earlyInitialize (
      this=0x2ba82439ece0, target=...) at StorePlugin.cpp:56
  #7  0x0000003c0d5aff52 in for_each<__gnu_cxx::__normal_iterator<qpid::Plugin* const*, std::vector<qpid::Plugin*, std::allocator<qpid::Plugin*> > >, boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::Plugin, qpid::Plugin::Target&>, boost::_bi::list2<boost::arg<1>, boost::reference_wrapper<qpid::Plugin::Target> > > > (t=<value optimized out>)
      at /usr/include/boost/bind/mem_fn_template.hpp:149
  #8  each_plugin<boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::Plugin, qpid::Plugin::Target&>, boost::_bi::list2<boost::arg<1>, boost::reference_wrapper<qpid::Plugin::Target> > > > (t=<value optimized out>) at qpid/Plugin.cpp:73
  #9  qpid::Plugin::earlyInitAll (t=<value optimized out>) at qpid/Plugin.cpp:87
  #10 0x0000003c0df14b89 in qpid::broker::Broker::Broker (this=0x921a9d0, 
      conf=...) at qpid/broker/Broker.cpp:214
  #11 0x0000000000406a50 in QpiddBroker::execute (this=<value optimized out>, 
      options=0x9213180) at posix/QpiddBroker.cpp:171
  #12 0x00000000004055af in main (argc=17, argv=0x7fff65c3bc48) at qpidd.cpp:80
  (gdb) quit

[08:49:43] Core file: ./core.25612 generated by /usr/sbin/qpidd ----------------------2/2-
./core.25612: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from 'qpidd'
  GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5)
  Copyright (C) 2009 Free Software Foundation, Inc.
  License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
  This is free software: you are free to change and redistribute it.
  There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
  and "show warranty" for details.
  This GDB was configured as "x86_64-redhat-linux-gnu".
  For bug reporting instructions, please see:
  warning: section .gnu.liblist not found in /usr/lib/debug/usr/sbin/qpidd.debug
  warning: section .dynbss not found in /usr/lib/debug/usr/sbin/qpidd.debug
  warning: section .gnu.conflict not found in /usr/lib/debug/usr/sbin/qpidd.debug
  [New Thread 25614]
  warning: section .gnu.liblist not found in /usr/lib/debug/usr/sbin/qpidd.debug
  warning: section .dynbss not found in /usr/lib/debug/usr/sbin/qpidd.debug
  warning: section .gnu.conflict not found in /usr/lib/debug/usr/sbin/qpidd.debug
  warning: section .gnu.liblist not found in /usr/lib/debug/usr/sbin/qpidd.debug
  warning: section .dynbss not found in /usr/lib/debug/usr/sbin/qpidd.debug
  warning: section .gnu.conflict not found in /usr/lib/debug/usr/sbin/qpidd.debug
  Core was generated by `/usr/sbin/qpidd --log-enable info+ --auth no --tpl-wcache-page-size 128 --tpl-j'.
  Program terminated with signal 11, Segmentation fault.
  #0  0x00002ba97da13440 in DbTxn::abort() () from /usr/lib64/libdb_cxx-4.3.so
  (gdb) rax            0x0	0
  rbx            0x1d2e9440	489591872
  rcx            0x1d2e97e0	489592800
  rdx            0x0	0
  rsi            0x1	1
  rdi            0x1d2e9440	489591872
  rbp            0x7fffa26912f0	0x7fffa26912f0
  rsp            0x7fffa2690df0	0x7fffa2690df0
  r8             0x0	0
  r9             0x2ba97d72caf9	48006954142457
  r10            0x2ba97d6db000	48006953807872
  r11            0x3c1e2bced0	258204225232
  r12            0x1d2e9860	489592928
  r13            0x2ba97d9c4a40	48006956862016
  r14            0x1d2e5028	489574440
  r15            0x1d2e5088	489574536
  rip            0x2ba97da13440	0x2ba97da13440 <DbTxn::abort()+32>
  eflags         0x10202	[ IF RF ]
  cs             0x33	51
  ss             0x2b	43
  ds             0x0	0
  es             0x0	0
  fs             0x0	0
  gs             0x0	0
  fctrl          0x37f	895
  fstat          0x0	0
  ftag           0xffff	65535
  fiseg          0x0	0
  fioff          0x0	0
  foseg          0x0	0
  fooff          0x0	0
  fop            0x0	0
  mxcsr          0x1f80	[ IM DM ZM OM UM PM ]
  (gdb) Using memory regions provided by the target.
  There are no memory regions defined.
  (gdb) 16   AT_HWCAP             Machine-dependent CPU capability hints 0xbfebfbff
  6    AT_PAGESZ            System page size               4096
  17   AT_CLKTCK            Frequency of times()           100
  3    AT_PHDR              Program headers for program    0x400040
  4    AT_PHENT             Size of program header entry   56
  5    AT_PHNUM             Number of program headers      8
  7    AT_BASE              Base address of interpreter    0x0
  8    AT_FLAGS             Flags                          0x0
  9    AT_ENTRY             Entry point of program         0x404ff0
  11   AT_UID               Real user ID                   0
  12   AT_EUID              Effective user ID              0
  13   AT_GID               Real group ID                  0
  14   AT_EGID              Effective group ID             0
  23   AT_SECURE            Boolean, was exec setuid-like? 0
  15   AT_PLATFORM          String identifying platform    0x7fffa2693fc9 "x86_64"
  0    AT_NULL              End of vector                  0x0
  (gdb) Stack level 0, frame at 0x7fffa2690e10:
   rip = 0x2ba97da13440 in DbTxn::abort(); saved rip 0x2ba97d74e01f
   called by frame at 0x7fffa2690e30
   Arglist at 0x7fffa2690de8, args: 
   Locals at 0x7fffa2690de8, Previous frame's sp is 0x7fffa2690e10
   Saved registers:
    rbx at 0x7fffa2690df0, rbp at 0x7fffa2690df8, r12 at 0x7fffa2690e00,
    rip at 0x7fffa2690e08
  (gdb) From                To                  Syms Read   Shared Object Library
  0x0000003c0dea14b0  0x0000003c0e022a78  Yes (*)     /usr/lib64/libqpidbroker.so.3
  0x0000003c0d50dd60  0x0000003c0d60f6c8  Yes (*)     /usr/lib64/libqpidcommon.so.3
  0x0000003c0cc10aa0  0x0000003c0cc2dae8  Yes (*)     /usr/lib64/libboost_program_options.so.2
  0x0000003c0da04810  0x0000003c0da0cff8  Yes (*)     /usr/lib64/libboost_filesystem.so.2
  0x0000003c20601500  0x0000003c20602918  Yes (*)     /lib64/libuuid.so.1
  0x0000003c0c400e10  0x0000003c0c401a08  Yes (*)     /lib64/libdl.so.2
  0x0000003c0d002220  0x0000003c0d005cc8  Yes (*)     /lib64/librt.so.1
  0x0000003c20a046e0  0x0000003c20a13be8  Yes (*)     /usr/lib64/libsasl2.so.2
  0x0000003c1e24f430  0x0000003c1e2c3058  Yes (*)     /usr/lib64/libstdc++.so.6
  0x0000003c0c003e60  0x0000003c0c043e38  Yes (*)     /lib64/libm.so.6
  0x0000003c19a01e50  0x0000003c19a0b018  Yes (*)     /lib64/libgcc_s.so.1
  0x0000003c0bc1d780  0x0000003c0bd09ff8  Yes (*)     /lib64/libc.so.6
  0x0000003c0b800a70  0x0000003c0b81671e  Yes (*)     /lib64/ld-linux-x86-64.so.2
  0x0000003c0c8051f0  0x0000003c0c810258  Yes (*)     /lib64/libpthread.so.0
  0x0000003c110032a0  0x0000003c1100e2d8  Yes (*)     /lib64/libresolv.so.2
  0x0000003c1e6009f0  0x0000003c1e606918  Yes (*)     /lib64/libcrypt.so.1
  0x00002ba97a98a5e0  0x00002ba97a98db68  Yes (*)     /usr/lib64/qpid/daemon/watchdog.so
  0x00002ba97ab987b0  0x00002ba97aba2568  Yes (*)     /usr/lib64/qpid/daemon/xml.so
  0x00002ba97af2e070  0x00002ba97b0af758  Yes (*)     /usr/lib64/libxerces-c.so.28
  0x00002ba97b544090  0x00002ba97b6cfb28  Yes (*)     /usr/lib64/libxqilla.so.3
  0x00002ba97ba15f10  0x00002ba97ba364e8  Yes (*)     /usr/lib64/qpid/daemon/acl.so
  0x00002ba97bc49570  0x00002ba97bc51728  Yes (*)     /usr/lib64/qpid/daemon/ssl.so
  0x00002ba97be77640  0x00002ba97be8e0a8  Yes (*)     /usr/lib64/libsslcommon.so.3
  0x0000003c1ea18080  0x0000003c1eaf4968  Yes (*)     /usr/lib64/libnss3.so
  0x0000003c1fe07ee0  0x0000003c1fe28ae8  Yes (*)     /usr/lib64/libssl3.so
  0x0000003c1fa0ced0  0x0000003c1fa2b018  Yes (*)     /usr/lib64/libnspr4.so
  0x0000003c1f2080d0  0x0000003c1f212428  Yes (*)     /usr/lib64/libnssutil3.so
  0x0000003c1ee01370  0x0000003c1ee02978  Yes (*)     /usr/lib64/libplc4.so
  0x0000003c1f600e10  0x0000003c1f601c08  Yes (*)     /usr/lib64/libplds4.so
  0x00002ba97c09db00  0x00002ba97c0a55a8  Yes (*)     /usr/lib64/qpid/daemon/rdma.so
  0x00002ba97c2c9270  0x00002ba97c2d6928  Yes (*)     /usr/lib64/librdmawrap.so.3
  0x00002ba97c4df360  0x00002ba97c4e5e38  Yes (*)     /usr/lib64/libibverbs.so.1
  0x00002ba97c6ea580  0x00002ba97c6ecf88  Yes (*)     /usr/lib64/librdmacm.so.1
  0x00002ba97c8f4b70  0x00002ba97c8fa648  Yes (*)     /usr/lib64/qpid/daemon/replicating_listener.so
  0x00002ba97cb3f330  0x00002ba97cba8ce8  Yes (*)     /usr/lib64/qpid/daemon/cluster.so
  0x00002ba97cde43d0  0x00002ba97cde6338  Yes (*)     /usr/lib64/openais/libcpg.so.2
  0x00002ba97cfe8110  0x00002ba97cfeab78  Yes (*)     /usr/lib64/libcman.so.2
  0x00002ba97d22b630  0x00002ba97d2a41d8  Yes (*)     /usr/lib64/libqpidclient.so.3
  0x00002ba97d4d3610  0x00002ba97d4d7bd8  Yes (*)     /usr/lib64/qpid/daemon/replication_exchange.so
  0x00002ba97d712e00  0x00002ba97d7a05c8  Yes (*)     /usr/lib64/qpid/daemon/msgstore.so
  0x00002ba97da095d0  0x00002ba97dac2288  Yes (*)     /usr/lib64/libdb_cxx-4.3.so
  0x00002ba97dced510  0x00002ba97dced6d1  Yes (*)     /usr/lib64/libaio.so.1
  (*): Shared library is missing debugging information.
  (gdb)   2 Thread 25614  0x0000003c0c80b150 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
     from /lib64/libpthread.so.0
  * 1 Thread 25612  0x00002ba97da13440 in DbTxn::abort() ()
     from /usr/lib64/libdb_cxx-4.3.so
  Thread 2 (Thread 25614):
  #0  0x0000003c0c80b150 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
     from /lib64/libpthread.so.0
  #1  0x0000003c0d5fafdf in qpid::sys::Timer::run (this=0x1d2cfa20)
      at ../include/qpid/sys/posix/Condition.h:69
  #2  0x0000003c0d52286a in qpid::sys::(anonymous namespace)::runRunnable (
      p=0x1d2cfa54) at qpid/sys/posix/Thread.cpp:35
  #3  0x0000003c0c80673d in start_thread () from /lib64/libpthread.so.0
  #4  0x0000003c0bcd3d1d in clone () from /lib64/libc.so.6
  Thread 1 (Thread 25612):
  #0  0x00002ba97da13440 in DbTxn::abort() () from /usr/lib64/libdb_cxx-4.3.so
  #1  0x00002ba97d74e01f in mrg::msgstore::TxnCtxt::abort (this=0x7fffa26912f0)
      at TxnCtxt.cpp:131
  #2  0x00002ba97d74ebb5 in mrg::msgstore::TxnCtxt::~TxnCtxt (this=0x1d2e9440, 
      __in_chrg=<value optimized out>) at TxnCtxt.cpp:77
  #3  0x00002ba97d72cc54 in mrg::msgstore::MessageStoreImpl::init (
      this=0x1d2e9860) at MessageStoreImpl.cpp:359
  #4  0x00002ba97d732459 in mrg::msgstore::MessageStoreImpl::init (
      this=0x1d2e5000, dir=..., jfiles=16, jfileSizePgs=32, truncateFlag=false, 
      wCachePageSizeKib=32, tplJfiles=8, tplJfileSizePgs=32, 
      tplWCachePageSizeKib=128, autoJExpand=false, autoJExpandMaxFiles=0)
      at MessageStoreImpl.cpp:289
  #5  0x00002ba97d733098 in mrg::msgstore::MessageStoreImpl::init (
      this=0x1d2e5000, options=<value optimized out>)
      at MessageStoreImpl.cpp:256
  #6  0x00002ba97d713485 in qpid::broker::StorePlugin::earlyInitialize (
      this=0x2ba97d9c8ce0, target=...) at StorePlugin.cpp:56
  #7  0x0000003c0d5aff52 in for_each<__gnu_cxx::__normal_iterator<qpid::Plugin* const*, std::vector<qpid::Plugin*, std::allocator<qpid::Plugin*> > >, boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::Plugin, qpid::Plugin::Target&>, boost::_bi::list2<boost::arg<1>, boost::reference_wrapper<qpid::Plugin::Target> > > > (t=<value optimized out>)
      at /usr/include/boost/bind/mem_fn_template.hpp:149
  #8  each_plugin<boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::Plugin, qpid::Plugin::Target&>, boost::_bi::list2<boost::arg<1>, boost::reference_wrapper<qpid::Plugin::Target> > > > (t=<value optimized out>) at qpid/Plugin.cpp:73
  #9  qpid::Plugin::earlyInitAll (t=<value optimized out>) at qpid/Plugin.cpp:87
  #10 0x0000003c0df14b89 in qpid::broker::Broker::Broker (this=0x1d2cf9d0, 
      conf=...) at qpid/broker/Broker.cpp:214
  #11 0x0000000000406a50 in QpiddBroker::execute (this=<value optimized out>, 
      options=0x1d2c8180) at posix/QpiddBroker.cpp:171
  #12 0x00000000004055af in main (argc=17, argv=0x7fffa2693be8) at qpidd.cpp:80
  (gdb) quit

Comment 1 Kim van der Riet 2010-07-15 18:07:24 UTC
Can you look closely at the broker logs in these cases for any unusual entries concerning store initialization?

Comment 2 Gordon Sim 2010-07-16 11:00:43 UTC
Removed blocker flag as this is only an issue on startup (i.e. isn't a crash of an operational broker) and is believed not to be a regression. Some more stats/data on reproducibility would help assess priority though...

Comment 3 Frantisek Reznicek 2010-07-16 11:57:15 UTC
Further information requested:
1] There were seen two occurrences of the crash on RHEL 5.5 x86_64 not i386
2] Unfortunately I do not have the core file timestamps, so I cannot exactly find which subrun of the tests caused the crash, there are following two candidates (qpidd logs follows info+) sorted:
[08:48:42] run 8/12--------------------------[err_cnt:1]-
qpidd-transcript-------------
2010-07-12 08:48:42 info Loaded Module: /usr/lib64/qpid/daemon/watchdog.so
2010-07-12 08:48:42 info Loaded Module: /usr/lib64/qpid/daemon/xml.so
2010-07-12 08:48:42 info Loaded Module: /usr/lib64/qpid/daemon/acl.so
2010-07-12 08:48:42 info Loaded Module: /usr/lib64/qpid/daemon/ssl.so
2010-07-12 08:48:42 info Loaded Module: /usr/lib64/qpid/daemon/rdma.so
2010-07-12 08:48:42 info Loaded Module: /usr/lib64/qpid/daemon/replicating_listener.so
2010-07-12 08:48:42 info Loaded Module: /usr/lib64/qpid/daemon/cluster.so
2010-07-12 08:48:42 info Loaded Module: /usr/lib64/qpid/daemon/replication_exchange.so
2010-07-12 08:48:42 info Loaded Module: /usr/lib64/qpid/daemon/msgstore.so
2010-07-12 08:48:42 info No message store configured, persistence is disabled.
2010-07-12 08:48:42 info Management enabled
2010-07-12 08:48:42 info ManagementAgent generated broker ID: c0e76e0c-7d78-4876-9658-81b8d32dc025
2010-07-12 08:48:42 info Registered xml exchange
2010-07-12 08:48:42 info Registered replication exchange
2010-07-12 08:48:42 error BDB exception occurred while initializing store: Db::open: No such file or directory

or less likely:

2010-07-12 08:49:23 info Loaded Module: /usr/lib64/qpid/daemon/watchdog.so
2010-07-12 08:49:23 info Loaded Module: /usr/lib64/qpid/daemon/xml.so
2010-07-12 08:49:23 info Loaded Module: /usr/lib64/qpid/daemon/acl.so
2010-07-12 08:49:23 info Loaded Module: /usr/lib64/qpid/daemon/ssl.so
2010-07-12 08:49:23 info Loaded Module: /usr/lib64/qpid/daemon/rdma.so
2010-07-12 08:49:23 info Loaded Module: /usr/lib64/qpid/daemon/replicating_listener.so
2010-07-12 08:49:23 info Loaded Module: /usr/lib64/qpid/daemon/cluster.so
2010-07-12 08:49:23 info Loaded Module: /usr/lib64/qpid/daemon/replication_exchange.so
2010-07-12 08:49:23 info Loaded Module: /usr/lib64/qpid/daemon/msgstore.so
2010-07-12 08:49:23 info No message store configured, persistence is disabled.
2010-07-12 08:49:23 info Management enabled
2010-07-12 08:49:23 info ManagementAgent generated broker ID: 71b036f8-5cf6-4df2-afa0-cdd9f498ad27
2010-07-12 08:49:23 info Registered xml exchange
2010-07-12 08:49:23 info Registered replication exchange
2010-07-12 08:49:32 notice Journal "TplStore": Created
2010-07-12 08:49:32 notice Store module initialized; store-dir=/mnt/tests/distribution/MRG/Messaging/qpid_broker_jfail_bz456272/rhts_qpidd/20100712_082559/broker.1451
2010-07-12 08:49:32 info > Default files per journal: 16
2010-07-12 08:49:32 info > Default journal file size: 32 (wpgs)
2010-07-12 08:49:32 info > Default write cache page size: 32 (Kib)
2010-07-12 08:49:32 info > Default number of write cache pages: 32
2010-07-12 08:49:32 info > TPL files per journal: 8
2010-07-12 08:49:32 info > TPL journal file size: 32 (wpgs)
2010-07-12 08:49:32 info > TPL write cache page size: 128 (Kib)
2010-07-12 08:49:32 info > TPL number of write cache pages: 8
2010-07-12 08:49:32 info Most recent persistence id found: 0x0
2010-07-12 08:49:32 notice SASL disabled: No Authentication Performed
2010-07-12 08:49:32 critical Unexpected error: Can't bind to port :5672: Address already in use (qpid/sys/posix/Socket.cpp:201)

I'm going to dump the corefile file stamp to avoid uncertainty.

Comment 4 Kim van der Riet 2010-07-16 12:08:32 UTC
The first log contains the info I am looking for:

BDB exception occurred while initializing store: Db::open: No such file or directory

This shows that it was the open of a BDB database file that failed. It might be possible that by keeping the disk extremely busy, that BDB is unable to create the necessary database files in a timely manner, and fail instead.

The exceptions that BDB throws are not being well handled in the store code here; attempting to abort() in the exception handler when the transaction has not even begun is obviously incorrect. However, this is a relatively easy fix. I will attempt some to reproduce the error; however, I suggest that:

1) This is not a regression, but a long-standing flaw;
2) Occurs rarely, and only on startup/initialization of the broker;
3) It might be possible to make more than one attempt to initialize BDB after pausing for a second or so between attempts, then failing after several tries.

Comment 5 Kim van der Riet 2010-07-16 12:17:08 UTC
I have confirmed that both errors in comment #1 above were accompanied by the same error message in the broker log:

BDB exception occurred while initializing store: Db::open: No such file or
directory

Comment 6 Kim van der Riet 2010-07-21 11:39:27 UTC
I have been unable to reproduce this error on mrg10 (in spite of using 3 running copies of watson and a running copy of tiobench). I attempted both running the test script in comment #1 and a repetitive starting/stoping the broker (since that is when the error occurs). This bug may show up on machines with certain disk controllers and/or other hardware characteristics.

However, I commited a fix in two parts:

r.4133: A fix for the exception handling. This fixes the poor handling of BDB errors when they occur (but does not change the odds of this particular exception from happening).

r.4134: I added a retry loop into the store initialization code which will allow the store to attempt to start BDB up to 3 times. If a BDB exception is thrown during initialization all three times, then the last exception will be rethrown, and the broker startup will be aborted.

I have tested both fixes by artificially throwing BDB exceptions during startup. However, the fix needs to be tested in an actual fail scenario. Owing to the inability to reproduce this on mrg10, I recommend that a trunk build of r.965840/r.4133/4 be made on the same machine on which these errors were observed. As this error is a store initialzation problem, it may be possible to speed up the occurrence of this error by replacing the test script with one that simply starts and stops the broker repetitively while the disk is kept busy by other activities.

I recommend that only r.4133 be put into MRG 1.3; r.4134 should make it into the next release.

Comment 7 Frantisek Reznicek 2010-08-18 09:26:33 UTC
The extensive testing showed another case of above mentioned issue on RHEL 5.5 x86_64 (qpid-cpp-server-0.7.946106-12.el4) resulting in broker's segmentation fault:

[16:33:09] Core file: ./core.20446 generated by /usr/sbin/qpidd ----------------------1/1-
-rw-------  1 root root 13889536 Aug 13 16:01 ./core.20446
./core.20446: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from 'qpidd'
  GNU gdb Red Hat Linux (6.3.0.0-1.162.el4rh)
  Copyright 2004 Free Software Foundation, Inc.
  GDB is free software, covered by the GNU General Public License, and you are
  welcome to change it and/or distribute copies of it under certain conditions.
  Type "show copying" to see the conditions.
  There is absolutely no warranty for GDB.  Type "show warranty" for details.
  This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/tls/libthread_db.so.1".
  Core was generated by `/usr/sbin/qpidd --log-enable info+ --auth no --tpl-wcache-page-size 128 --tpl-j'.
  Program terminated with signal 11, Segmentation fault.
  #0  0x0000002a95a23e8a in DbTxn::abort () from /usr/lib64/tls/libdb_cxx-4.2.so
  (gdb) rax            0x0	0
  rbx            0x570500	5702912
  rcx            0x570980	5704064
  rdx            0x2a95a81ef5	182899449589
  rsi            0x2a95ac9620	182899742240
  rdi            0x570500	5702912
  rbp            0xffff86fe	0xffff86fe
  rsp            0x7fbfffd2c0	0x7fbfffd2c0
  r8             0x1d680	120448
  r9             0x39ca415000	248206413824
  r10            0x1	1
  r11            0x39ce8af200	248278348288
  r12            0x570410	5702672
  r13            0x7fbfffdab0	548682062512
  r14            0x7fbfffda90	548682062480
  r15            0x56c2a8	5685928
  rip            0x2a95a23e8a	0x2a95a23e8a <DbTxn::abort()+58>
  eflags         0x10206	66054
  cs             0x33	51
  ss             0x2b	43
  ds             0x0	0
  es             0x0	0
  fs             0x0	0
  gs             0x0	0
  (gdb) There are no memory regions defined.
  (gdb) 16   AT_HWCAP             Machine-dependent CPU capability hints 0x178bfbff
  6    AT_PAGESZ            System page size               4096
  17   AT_CLKTCK            Frequency of times()           100
  3    AT_PHDR              Program headers for program    0x400040
  4    AT_PHENT             Size of program header entry   56
  5    AT_PHNUM             Number of program headers      8
  7    AT_BASE              Base address of interpreter    0x0
  8    AT_FLAGS             Flags                          0x0
  9    AT_ENTRY             Entry point of program         0x4054e0
  11   AT_UID               Real user ID                   0
  12   AT_EUID              Effective user ID              0
  13   AT_GID               Real group ID                  0
  14   AT_EGID              Effective group ID             0
  23   AT_SECURE            Boolean, was exec setuid-like? 0
  15   AT_PLATFORM          String identifying platform    0x7fbfffeff7 "x86_64"
  0    AT_NULL              End of vector                  0x0
  (gdb) Stack level 0, frame at 0x7fbfffd2e0:
   rip = 0x2a95a23e8a in DbTxn::abort(); saved rip 0x2a958510a5
   called by frame at 0x7fbfffd4a0
   Arglist at 0x7fbfffd2b8, args: 
   Locals at 0x7fbfffd2b8, Previous frame's sp is 0x7fbfffd2e0
   Saved registers:
    rbx at 0x7fbfffd2c0, rbp at 0x7fbfffd2c8, r12 at 0x7fbfffd2d0,
    rip at 0x7fbfffd2d8
  (gdb) From                To                  Syms Read   Shared Object Library
  0x00000039cbbc7500  0x00000039cbd9b2d8  Yes         /usr/lib64/libqpidbroker.so.3
  0x00000039cac47ec0  0x00000039cad59558  Yes         /usr/lib64/libqpidcommon.so.3
  0x00000039cb707990  0x00000039cb7158d8  Yes         /usr/lib64/libqpidtypes.so.1
  0x00000039cb914160  0x00000039cb935e68  Yes         /usr/lib64/libboost_program_options.so.1
  0x00000039ca905cc0  0x00000039ca90d7a8  Yes         /usr/lib64/libboost_filesystem.so.1
  0x00000039ca701670  0x00000039ca7028c8  Yes         /lib64/tls/libuuid.so.1
  0x00000039d0202c90  0x00000039d0207c08  Yes         /lib64/tls/librt.so.1
  0x00000039d22044f0  0x00000039d2211ec8  Yes         /usr/lib64/libsasl2.so.2
  0x00000039ca500f80  0x00000039ca501918  Yes         /lib64/libdl.so.2
  0x00000039ccf036a0  0x00000039ccf0ee28  Yes         /lib64/libresolv.so.2
  0x00000039d1e00bf0  0x00000039d1e05668  Yes         /lib64/libcrypt.so.1
  0x00000039ce84d090  0x00000039ce8b5498  Yes         /usr/lib64/libstdc++.so.6
  0x00000039cb303df0  0x00000039cb346b98  Yes         /lib64/tls/libm.so.6
  0x00000039ce602060  0x00000039ce60ab08  Yes         /lib64/libgcc_s.so.1
  0x00000039cb01c2d0  0x00000039cb0fe08c  Yes         /lib64/tls/libc.so.6
  0x00000039cb505100  0x00000039cb50d228  Yes         /lib64/tls/libpthread.so.0
  0x00000039ca300a80  0x00000039ca310de7  Yes         /lib64/ld-linux-x86-64.so.2
  0x0000002a955902a0  0x0000002a955afa58  Yes         /usr/lib64/qpid/daemon/acl.so
  0x0000002a956c0490  0x0000002a956c6428  Yes         /usr/lib64/qpid/daemon/replication_exchange.so
  0x0000002a95803630  0x0000002a958aa028  Yes         /usr/lib64/qpid/daemon/msgstore.so
  0x0000002a95a1bea0  0x0000002a95ac0f68  Yes         /usr/lib64/tls/libdb_cxx-4.2.so
  0x0000002a95be85b0  0x0000002a95be8776  Yes         /usr/lib64/libaio.so.1
  0x0000002a95cf3c70  0x0000002a95cfeab8  Yes         /usr/lib64/qpid/daemon/ssl.so
  0x0000002a95e327c0  0x0000002a95e4ac18  Yes         /usr/lib64/libsslcommon.so.3
  0x00000039d3918d50  0x00000039d39fed38  Yes         /usr/lib64/libnss3.so
  0x00000039d3708ca0  0x00000039d372c018  Yes         /usr/lib64/libssl3.so
  0x00000039d240d4f0  0x00000039d242bec8  Yes         /usr/lib64/libnspr4.so
  0x00000039d35085f0  0x00000039d3512c08  Yes         /usr/lib64/libnssutil3.so
  0x00000039d26015d0  0x00000039d2602cb8  Yes         /usr/lib64/libplc4.so
  0x00000039d2801010  0x00000039d2801e68  Yes         /usr/lib64/libplds4.so
  0x0000002a95f53f20  0x0000002a95f5d278  Yes         /usr/lib64/libz.so.1
  0x0000002a9606c990  0x0000002a96076928  Yes         /usr/lib64/qpid/daemon/replicating_listener.so
  0x0000002a96183f60  0x0000002a9618e228  Yes         /usr/lib64/qpid/daemon/xml.so
  0x0000002a96424330  0x0000002a965b2098  Yes         /usr/lib64/libxerces-c.so.28
  0x0000002a96940cf0  0x0000002a96ad23c8  Yes         /usr/lib64/libxqilla.so.3
  (gdb)   2 process 20448  0x00000039cb508f0f in pthread_cond_timedwait@@GLIBC_2.3.2
      () from /lib64/tls/libpthread.so.0
  * 1 process 20446  0x0000002a95a23e8a in DbTxn::abort ()
     from /usr/lib64/tls/libdb_cxx-4.2.so
  Thread 2 (process 20448):
  #0  0x00000039cb508f0f in pthread_cond_timedwait@@GLIBC_2.3.2 ()
     from /lib64/tls/libpthread.so.0
  #1  0x00000039cad519c3 in qpid::sys::Timer::run (this=0x568690)
      at ../include/qpid/sys/posix/Condition.h:69
  #2  0x00000039cac6008a in qpid::sys::(anonymous namespace)::runRunnable (
      p=0x5686c4) at qpid/sys/posix/Thread.cpp:35
  #3  0x00000039cb506317 in start_thread () from /lib64/tls/libpthread.so.0
  #4  0x00000039cb0c9d83 in clone () from /lib64/tls/libc.so.6
  Thread 1 (process 20446):
  #0  0x0000002a95a23e8a in DbTxn::abort () from /usr/lib64/tls/libdb_cxx-4.2.so
  #1  0x0000002a958510a5 in mrg::msgstore::TxnCtxt::abort (this=0x7fbfffda20)
      at TxnCtxt.cpp:131
  #2  0x0000002a95851d72 in ~TxnCtxt (this=0x570500) at TxnCtxt.cpp:77
  #3  0x0000002a9582fe88 in mrg::msgstore::MessageStoreImpl::init (this=Variable "this" is not available.
      at MessageStoreImpl.cpp:346
  #4  0x0000002a95831558 in mrg::msgstore::MessageStoreImpl::init (
      this=0x56c290, dir=@0x2a959d1f90, jfiles=16, jfileSizePgs=32, 
      truncateFlag=false, wCachePageSizeKib=32, tplJfiles=8, 
      tplJfileSizePgs=32, tplWCachePageSizeKib=128, autoJExpand=false, 
      autoJExpandMaxFiles=0) at MessageStoreImpl.cpp:289
  #5  0x0000002a9583c952 in mrg::msgstore::MessageStoreImpl::init (
      this=0x56c290, options=Variable "options" is not available.
  ) at MessageStoreImpl.cpp:256
  #6  0x0000002a95804040 in qpid::broker::StorePlugin::earlyInitialize (
      this=0x2a959d1ee0, target=@0x568648) at StorePlugin.cpp:56
  #7  0x00000039cacf8e55 in std::for_each<__gnu_cxx::__normal_iterator<qpid::Plugin* const*, std::vector<qpid::Plugin*, std::allocator<qpid::Plugin*> > >, boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::Plugin, qpid::Plugin::Target&>, boost::_bi::list2<boost::arg<1>, boost::reference_wrapper<qpid::Plugin::Target> > > > (__first={_M_current = 0x557f78}, __last={_M_current = 0x557f98}, 
            {f_ = {f_ = {__pfn = 0x19, __delta = 0}}, l_ = {a1_ = {<No data fields>}, a2_ = {t_ = 0x568648}}}) at /usr/include/boost/bind/mem_fn_template.hpp:149
  #8  0x00000039cacf8f29 in qpid::(anonymous namespace)::each_plugin<boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::Plugin, qpid::Plugin::Target&>, boost::_bi::list2<boost::arg<1>, boost::reference_wrapper<qpid::Plugin::Target> > > > (f=@0x7fbfffdf80)
      at /usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../../include/c++/3.4.6/bits/stl_vector.h:322
  #9  0x00000039cacf8134 in qpid::Plugin::earlyInitAll (t=Variable "t" is not available.
      at /usr/include/boost/ref.hpp:64
  #10 0x00000039cbc598be in Broker (this=0x568640, conf=@0x54d3f8)
      at qpid/broker/Broker.cpp:216
  #11 0x000000000040e3a1 in QpiddBroker::execute (this=Variable "this" is not available.
      at /usr/include/boost/intrusive_ptr.hpp:62
  #12 0x0000000000408d34 in main (argc=17, argv=0x7fbfffeb98)
      at /usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../../include/c++/3.4.6/memory:301
  #13 0x00000039cb01c4cb in __libc_start_main () from /lib64/tls/libc.so.6
  #14 0x000000000040550a in _start ()
  #15 0x0000007fbfffeb88 in ?? ()
  #16 0x000000000000001c in ?? ()
  #17 0x0000000000000011 in ?? ()
  #18 0x0000007fbfffeffe in ?? ()
  #19 0x0000007fbffff00e in ?? ()
  #20 0x0000007fbffff01b in ?? ()
  #21 0x0000007fbffff021 in ?? ()
  #22 0x0000007fbffff028 in ?? ()
  #23 0x0000007fbffff02b in ?? ()
  #24 0x0000007fbffff042 in ?? ()
  #25 0x0000007fbffff046 in ?? ()
  #26 0x0000007fbffff05b in ?? ()
  #27 0x0000007fbffff05e in ?? ()
  #28 0x0000007fbffff06b in ?? ()
  #29 0x0000007fbffff06e in ?? ()
  #30 0x0000007fbffff07f in ?? ()
  #31 0x0000007fbffff082 in ?? ()
  #32 0x0000007fbffff08d in ?? ()
  #33 0x0000007fbffff0f5 in ?? ()
  #34 0x0000007fbffff0fc in ?? ()
  #35 0x0000000000000000 in ?? ()

Looking at the error messages of the broker I found no occurrence of above discussed message: 'error BDB exception occurred while initializing store:'
The recorded log looks like this:

2010-08-13 15:58:58 info Loaded Module: /usr/lib64/qpid/daemon/acl.so
2010-08-13 15:58:58 info Loaded Module: /usr/lib64/qpid/daemon/replication_exchange.so
2010-08-13 15:58:58 info Loaded Module: /usr/lib64/qpid/daemon/msgstore.so
2010-08-13 15:58:58 info Loaded Module: /usr/lib64/qpid/daemon/ssl.so
2010-08-13 15:58:58 info Loaded Module: /usr/lib64/qpid/daemon/replicating_listener.so
2010-08-13 15:58:58 info Loaded Module: /usr/lib64/qpid/daemon/xml.so
2010-08-13 15:58:58 info No message store configured, persistence is disabled.
2010-08-13 15:58:58 info Management enabled
2010-08-13 15:58:58 info ManagementAgent generated broker ID: 00d2b040-1ed2-45b0-a9c7-5a2b7e29c02b
2010-08-13 15:58:58 info Registered replication exchange



I believe this proves that fix is not efficient enough.

-> ASSIGNED

Comment 9 Kim van der Riet 2010-08-18 12:55:28 UTC
Note from comment 6 above that the fix came in two parts. Only the exception handling part was included into 1.3, and thus the odds of this error occurring are unchanged.

However, the way in which the exception in comment 7 unfolded was unexpected. The basic failure was in BDB; either the BDB environment or the transaction begin failed. However, this seems not to be handed in the way I was expecting. I will investigate this aspect further.

I recommend pushing this bug out of 1.3 at this point. In any event, the fix for the cause of the bug is not in 1.3 either - see comment 6 above.

Comment 10 Kim van der Riet 2010-08-18 15:42:18 UTC
Additional error handling included in r.4212. I found one case of an improperly handled exception while initializing BDB, this is the only way that the trace seen above can be explained (that I can see).

Comment 11 Kim van der Riet 2010-08-19 13:49:29 UTC
Changed target milestone to 1.4.

The fix for the cause of this is on trunk, but has not been included into 1.3 (although the improved exception handling has, because of its relatively low risk). Once MRG rebases, this fix will get into the build.

Comment 14 Petr Matousek 2011-05-07 16:35:22 UTC
This issue has been fixed

Verified on RHEL5.6 and RHEL6.1, architectures: i386, x86_64

bz456272.py executed in loop 1800 times
Test duration: over 100 hours
Result: no broker crash occurred

packages installed:
python-qpid-0.10-1.el6.noarch
python-qpid-qmf-0.10-6.el6.x86_64
qpid-cpp-client-0.10-3.el6.x86_64
qpid-cpp-client-devel-0.10-3.el6.x86_64
qpid-cpp-client-devel-docs-0.10-3.el6.noarch
qpid-cpp-client-rdma-0.10-3.el6.x86_64
qpid-cpp-client-ssl-0.10-3.el6.x86_64
qpid-cpp-server-0.10-3.el6.x86_64
qpid-cpp-server-cluster-0.10-3.el6.x86_64
qpid-cpp-server-devel-0.10-3.el6.x86_64
qpid-cpp-server-rdma-0.10-3.el6.x86_64
qpid-cpp-server-ssl-0.10-3.el6.x86_64
qpid-cpp-server-store-0.10-3.el6.x86_64
qpid-cpp-server-xml-0.10-3.el6.x86_64
qpid-java-client-0.10-4.el6.noarch
qpid-java-common-0.10-4.el6.noarch
qpid-java-example-0.10-4.el6.noarch
qpid-java-jca-0.10-4.el6.noarch
qpid-qmf-0.10-6.el6.x86_64
qpid-qmf-devel-0.10-6.el6.x86_64
qpid-tests-0.10-1.el6.noarch
qpid-tools-0.10-3.el6.noarch
rh-qpid-cpp-tests-0.10-3.el6.x86_64
rh-tests-distribution-MRG-Messaging-qpid_broker_jfail_bz456272-1.5-8.noarch
rh-tests-distribution-MRG-Messaging-qpid_common-1.6-56.noarch
ruby-qpid-0.7.946106-2.el6.x86_64
ruby-qpid-qmf-0.10-6.el6.x86_64

-> VERIFIED

Comment 15 errata-xmlrpc 2011-06-23 15:45:35 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0890.html