Bug 614944
Summary: | qpidd broker crash in mrg::msgstore::TxnCtxt::abort() -> DbTxn::abort() | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Frantisek Reznicek <freznice> |
Component: | qpid-cpp | Assignee: | Kim van der Riet <kim.vdriet> |
Status: | CLOSED ERRATA | QA Contact: | Petr Matousek <pematous> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | Development | CC: | esammons, gsim, jross, kim.vdriet, pematous |
Target Milestone: | 2.0 | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | qpid-cpp-mrg-0.9.1079953 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2011-06-23 15:45:35 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Frantisek Reznicek
2010-07-15 15:37:51 UTC
Can you look closely at the broker logs in these cases for any unusual entries concerning store initialization? Removed blocker flag as this is only an issue on startup (i.e. isn't a crash of an operational broker) and is believed not to be a regression. Some more stats/data on reproducibility would help assess priority though... Further information requested: 1] There were seen two occurrences of the crash on RHEL 5.5 x86_64 not i386 2] Unfortunately I do not have the core file timestamps, so I cannot exactly find which subrun of the tests caused the crash, there are following two candidates (qpidd logs follows info+) sorted: [08:48:42] run 8/12--------------------------[err_cnt:1]- qpidd-transcript------------- 2010-07-12 08:48:42 info Loaded Module: /usr/lib64/qpid/daemon/watchdog.so 2010-07-12 08:48:42 info Loaded Module: /usr/lib64/qpid/daemon/xml.so 2010-07-12 08:48:42 info Loaded Module: /usr/lib64/qpid/daemon/acl.so 2010-07-12 08:48:42 info Loaded Module: /usr/lib64/qpid/daemon/ssl.so 2010-07-12 08:48:42 info Loaded Module: /usr/lib64/qpid/daemon/rdma.so 2010-07-12 08:48:42 info Loaded Module: /usr/lib64/qpid/daemon/replicating_listener.so 2010-07-12 08:48:42 info Loaded Module: /usr/lib64/qpid/daemon/cluster.so 2010-07-12 08:48:42 info Loaded Module: /usr/lib64/qpid/daemon/replication_exchange.so 2010-07-12 08:48:42 info Loaded Module: /usr/lib64/qpid/daemon/msgstore.so 2010-07-12 08:48:42 info No message store configured, persistence is disabled. 2010-07-12 08:48:42 info Management enabled 2010-07-12 08:48:42 info ManagementAgent generated broker ID: c0e76e0c-7d78-4876-9658-81b8d32dc025 2010-07-12 08:48:42 info Registered xml exchange 2010-07-12 08:48:42 info Registered replication exchange 2010-07-12 08:48:42 error BDB exception occurred while initializing store: Db::open: No such file or directory or less likely: 2010-07-12 08:49:23 info Loaded Module: /usr/lib64/qpid/daemon/watchdog.so 2010-07-12 08:49:23 info Loaded Module: /usr/lib64/qpid/daemon/xml.so 2010-07-12 08:49:23 info Loaded Module: /usr/lib64/qpid/daemon/acl.so 2010-07-12 08:49:23 info Loaded Module: /usr/lib64/qpid/daemon/ssl.so 2010-07-12 08:49:23 info Loaded Module: /usr/lib64/qpid/daemon/rdma.so 2010-07-12 08:49:23 info Loaded Module: /usr/lib64/qpid/daemon/replicating_listener.so 2010-07-12 08:49:23 info Loaded Module: /usr/lib64/qpid/daemon/cluster.so 2010-07-12 08:49:23 info Loaded Module: /usr/lib64/qpid/daemon/replication_exchange.so 2010-07-12 08:49:23 info Loaded Module: /usr/lib64/qpid/daemon/msgstore.so 2010-07-12 08:49:23 info No message store configured, persistence is disabled. 2010-07-12 08:49:23 info Management enabled 2010-07-12 08:49:23 info ManagementAgent generated broker ID: 71b036f8-5cf6-4df2-afa0-cdd9f498ad27 2010-07-12 08:49:23 info Registered xml exchange 2010-07-12 08:49:23 info Registered replication exchange 2010-07-12 08:49:32 notice Journal "TplStore": Created 2010-07-12 08:49:32 notice Store module initialized; store-dir=/mnt/tests/distribution/MRG/Messaging/qpid_broker_jfail_bz456272/rhts_qpidd/20100712_082559/broker.1451 2010-07-12 08:49:32 info > Default files per journal: 16 2010-07-12 08:49:32 info > Default journal file size: 32 (wpgs) 2010-07-12 08:49:32 info > Default write cache page size: 32 (Kib) 2010-07-12 08:49:32 info > Default number of write cache pages: 32 2010-07-12 08:49:32 info > TPL files per journal: 8 2010-07-12 08:49:32 info > TPL journal file size: 32 (wpgs) 2010-07-12 08:49:32 info > TPL write cache page size: 128 (Kib) 2010-07-12 08:49:32 info > TPL number of write cache pages: 8 2010-07-12 08:49:32 info Most recent persistence id found: 0x0 2010-07-12 08:49:32 notice SASL disabled: No Authentication Performed 2010-07-12 08:49:32 critical Unexpected error: Can't bind to port :5672: Address already in use (qpid/sys/posix/Socket.cpp:201) I'm going to dump the corefile file stamp to avoid uncertainty. The first log contains the info I am looking for: BDB exception occurred while initializing store: Db::open: No such file or directory This shows that it was the open of a BDB database file that failed. It might be possible that by keeping the disk extremely busy, that BDB is unable to create the necessary database files in a timely manner, and fail instead. The exceptions that BDB throws are not being well handled in the store code here; attempting to abort() in the exception handler when the transaction has not even begun is obviously incorrect. However, this is a relatively easy fix. I will attempt some to reproduce the error; however, I suggest that: 1) This is not a regression, but a long-standing flaw; 2) Occurs rarely, and only on startup/initialization of the broker; 3) It might be possible to make more than one attempt to initialize BDB after pausing for a second or so between attempts, then failing after several tries. I have confirmed that both errors in comment #1 above were accompanied by the same error message in the broker log: BDB exception occurred while initializing store: Db::open: No such file or directory I have been unable to reproduce this error on mrg10 (in spite of using 3 running copies of watson and a running copy of tiobench). I attempted both running the test script in comment #1 and a repetitive starting/stoping the broker (since that is when the error occurs). This bug may show up on machines with certain disk controllers and/or other hardware characteristics. However, I commited a fix in two parts: r.4133: A fix for the exception handling. This fixes the poor handling of BDB errors when they occur (but does not change the odds of this particular exception from happening). r.4134: I added a retry loop into the store initialization code which will allow the store to attempt to start BDB up to 3 times. If a BDB exception is thrown during initialization all three times, then the last exception will be rethrown, and the broker startup will be aborted. I have tested both fixes by artificially throwing BDB exceptions during startup. However, the fix needs to be tested in an actual fail scenario. Owing to the inability to reproduce this on mrg10, I recommend that a trunk build of r.965840/r.4133/4 be made on the same machine on which these errors were observed. As this error is a store initialzation problem, it may be possible to speed up the occurrence of this error by replacing the test script with one that simply starts and stops the broker repetitively while the disk is kept busy by other activities. I recommend that only r.4133 be put into MRG 1.3; r.4134 should make it into the next release. The extensive testing showed another case of above mentioned issue on RHEL 5.5 x86_64 (qpid-cpp-server-0.7.946106-12.el4) resulting in broker's segmentation fault: [16:33:09] Core file: ./core.20446 generated by /usr/sbin/qpidd ----------------------1/1- -rw------- 1 root root 13889536 Aug 13 16:01 ./core.20446 ./core.20446: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from 'qpidd' GNU gdb Red Hat Linux (6.3.0.0-1.162.el4rh) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/tls/libthread_db.so.1". Core was generated by `/usr/sbin/qpidd --log-enable info+ --auth no --tpl-wcache-page-size 128 --tpl-j'. Program terminated with signal 11, Segmentation fault. #0 0x0000002a95a23e8a in DbTxn::abort () from /usr/lib64/tls/libdb_cxx-4.2.so (gdb) rax 0x0 0 rbx 0x570500 5702912 rcx 0x570980 5704064 rdx 0x2a95a81ef5 182899449589 rsi 0x2a95ac9620 182899742240 rdi 0x570500 5702912 rbp 0xffff86fe 0xffff86fe rsp 0x7fbfffd2c0 0x7fbfffd2c0 r8 0x1d680 120448 r9 0x39ca415000 248206413824 r10 0x1 1 r11 0x39ce8af200 248278348288 r12 0x570410 5702672 r13 0x7fbfffdab0 548682062512 r14 0x7fbfffda90 548682062480 r15 0x56c2a8 5685928 rip 0x2a95a23e8a 0x2a95a23e8a <DbTxn::abort()+58> eflags 0x10206 66054 cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 (gdb) There are no memory regions defined. (gdb) 16 AT_HWCAP Machine-dependent CPU capability hints 0x178bfbff 6 AT_PAGESZ System page size 4096 17 AT_CLKTCK Frequency of times() 100 3 AT_PHDR Program headers for program 0x400040 4 AT_PHENT Size of program header entry 56 5 AT_PHNUM Number of program headers 8 7 AT_BASE Base address of interpreter 0x0 8 AT_FLAGS Flags 0x0 9 AT_ENTRY Entry point of program 0x4054e0 11 AT_UID Real user ID 0 12 AT_EUID Effective user ID 0 13 AT_GID Real group ID 0 14 AT_EGID Effective group ID 0 23 AT_SECURE Boolean, was exec setuid-like? 0 15 AT_PLATFORM String identifying platform 0x7fbfffeff7 "x86_64" 0 AT_NULL End of vector 0x0 (gdb) Stack level 0, frame at 0x7fbfffd2e0: rip = 0x2a95a23e8a in DbTxn::abort(); saved rip 0x2a958510a5 called by frame at 0x7fbfffd4a0 Arglist at 0x7fbfffd2b8, args: Locals at 0x7fbfffd2b8, Previous frame's sp is 0x7fbfffd2e0 Saved registers: rbx at 0x7fbfffd2c0, rbp at 0x7fbfffd2c8, r12 at 0x7fbfffd2d0, rip at 0x7fbfffd2d8 (gdb) From To Syms Read Shared Object Library 0x00000039cbbc7500 0x00000039cbd9b2d8 Yes /usr/lib64/libqpidbroker.so.3 0x00000039cac47ec0 0x00000039cad59558 Yes /usr/lib64/libqpidcommon.so.3 0x00000039cb707990 0x00000039cb7158d8 Yes /usr/lib64/libqpidtypes.so.1 0x00000039cb914160 0x00000039cb935e68 Yes /usr/lib64/libboost_program_options.so.1 0x00000039ca905cc0 0x00000039ca90d7a8 Yes /usr/lib64/libboost_filesystem.so.1 0x00000039ca701670 0x00000039ca7028c8 Yes /lib64/tls/libuuid.so.1 0x00000039d0202c90 0x00000039d0207c08 Yes /lib64/tls/librt.so.1 0x00000039d22044f0 0x00000039d2211ec8 Yes /usr/lib64/libsasl2.so.2 0x00000039ca500f80 0x00000039ca501918 Yes /lib64/libdl.so.2 0x00000039ccf036a0 0x00000039ccf0ee28 Yes /lib64/libresolv.so.2 0x00000039d1e00bf0 0x00000039d1e05668 Yes /lib64/libcrypt.so.1 0x00000039ce84d090 0x00000039ce8b5498 Yes /usr/lib64/libstdc++.so.6 0x00000039cb303df0 0x00000039cb346b98 Yes /lib64/tls/libm.so.6 0x00000039ce602060 0x00000039ce60ab08 Yes /lib64/libgcc_s.so.1 0x00000039cb01c2d0 0x00000039cb0fe08c Yes /lib64/tls/libc.so.6 0x00000039cb505100 0x00000039cb50d228 Yes /lib64/tls/libpthread.so.0 0x00000039ca300a80 0x00000039ca310de7 Yes /lib64/ld-linux-x86-64.so.2 0x0000002a955902a0 0x0000002a955afa58 Yes /usr/lib64/qpid/daemon/acl.so 0x0000002a956c0490 0x0000002a956c6428 Yes /usr/lib64/qpid/daemon/replication_exchange.so 0x0000002a95803630 0x0000002a958aa028 Yes /usr/lib64/qpid/daemon/msgstore.so 0x0000002a95a1bea0 0x0000002a95ac0f68 Yes /usr/lib64/tls/libdb_cxx-4.2.so 0x0000002a95be85b0 0x0000002a95be8776 Yes /usr/lib64/libaio.so.1 0x0000002a95cf3c70 0x0000002a95cfeab8 Yes /usr/lib64/qpid/daemon/ssl.so 0x0000002a95e327c0 0x0000002a95e4ac18 Yes /usr/lib64/libsslcommon.so.3 0x00000039d3918d50 0x00000039d39fed38 Yes /usr/lib64/libnss3.so 0x00000039d3708ca0 0x00000039d372c018 Yes /usr/lib64/libssl3.so 0x00000039d240d4f0 0x00000039d242bec8 Yes /usr/lib64/libnspr4.so 0x00000039d35085f0 0x00000039d3512c08 Yes /usr/lib64/libnssutil3.so 0x00000039d26015d0 0x00000039d2602cb8 Yes /usr/lib64/libplc4.so 0x00000039d2801010 0x00000039d2801e68 Yes /usr/lib64/libplds4.so 0x0000002a95f53f20 0x0000002a95f5d278 Yes /usr/lib64/libz.so.1 0x0000002a9606c990 0x0000002a96076928 Yes /usr/lib64/qpid/daemon/replicating_listener.so 0x0000002a96183f60 0x0000002a9618e228 Yes /usr/lib64/qpid/daemon/xml.so 0x0000002a96424330 0x0000002a965b2098 Yes /usr/lib64/libxerces-c.so.28 0x0000002a96940cf0 0x0000002a96ad23c8 Yes /usr/lib64/libxqilla.so.3 (gdb) 2 process 20448 0x00000039cb508f0f in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 * 1 process 20446 0x0000002a95a23e8a in DbTxn::abort () from /usr/lib64/tls/libdb_cxx-4.2.so Thread 2 (process 20448): #0 0x00000039cb508f0f in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x00000039cad519c3 in qpid::sys::Timer::run (this=0x568690) at ../include/qpid/sys/posix/Condition.h:69 #2 0x00000039cac6008a in qpid::sys::(anonymous namespace)::runRunnable ( p=0x5686c4) at qpid/sys/posix/Thread.cpp:35 #3 0x00000039cb506317 in start_thread () from /lib64/tls/libpthread.so.0 #4 0x00000039cb0c9d83 in clone () from /lib64/tls/libc.so.6 Thread 1 (process 20446): #0 0x0000002a95a23e8a in DbTxn::abort () from /usr/lib64/tls/libdb_cxx-4.2.so #1 0x0000002a958510a5 in mrg::msgstore::TxnCtxt::abort (this=0x7fbfffda20) at TxnCtxt.cpp:131 #2 0x0000002a95851d72 in ~TxnCtxt (this=0x570500) at TxnCtxt.cpp:77 #3 0x0000002a9582fe88 in mrg::msgstore::MessageStoreImpl::init (this=Variable "this" is not available. at MessageStoreImpl.cpp:346 #4 0x0000002a95831558 in mrg::msgstore::MessageStoreImpl::init ( this=0x56c290, dir=@0x2a959d1f90, jfiles=16, jfileSizePgs=32, truncateFlag=false, wCachePageSizeKib=32, tplJfiles=8, tplJfileSizePgs=32, tplWCachePageSizeKib=128, autoJExpand=false, autoJExpandMaxFiles=0) at MessageStoreImpl.cpp:289 #5 0x0000002a9583c952 in mrg::msgstore::MessageStoreImpl::init ( this=0x56c290, options=Variable "options" is not available. ) at MessageStoreImpl.cpp:256 #6 0x0000002a95804040 in qpid::broker::StorePlugin::earlyInitialize ( this=0x2a959d1ee0, target=@0x568648) at StorePlugin.cpp:56 #7 0x00000039cacf8e55 in std::for_each<__gnu_cxx::__normal_iterator<qpid::Plugin* const*, std::vector<qpid::Plugin*, std::allocator<qpid::Plugin*> > >, boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::Plugin, qpid::Plugin::Target&>, boost::_bi::list2<boost::arg<1>, boost::reference_wrapper<qpid::Plugin::Target> > > > (__first={_M_current = 0x557f78}, __last={_M_current = 0x557f98}, {f_ = {f_ = {__pfn = 0x19, __delta = 0}}, l_ = {a1_ = {<No data fields>}, a2_ = {t_ = 0x568648}}}) at /usr/include/boost/bind/mem_fn_template.hpp:149 #8 0x00000039cacf8f29 in qpid::(anonymous namespace)::each_plugin<boost::_bi::bind_t<void, boost::_mfi::mf1<void, qpid::Plugin, qpid::Plugin::Target&>, boost::_bi::list2<boost::arg<1>, boost::reference_wrapper<qpid::Plugin::Target> > > > (f=@0x7fbfffdf80) at /usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../../include/c++/3.4.6/bits/stl_vector.h:322 #9 0x00000039cacf8134 in qpid::Plugin::earlyInitAll (t=Variable "t" is not available. at /usr/include/boost/ref.hpp:64 #10 0x00000039cbc598be in Broker (this=0x568640, conf=@0x54d3f8) at qpid/broker/Broker.cpp:216 #11 0x000000000040e3a1 in QpiddBroker::execute (this=Variable "this" is not available. at /usr/include/boost/intrusive_ptr.hpp:62 #12 0x0000000000408d34 in main (argc=17, argv=0x7fbfffeb98) at /usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../../include/c++/3.4.6/memory:301 #13 0x00000039cb01c4cb in __libc_start_main () from /lib64/tls/libc.so.6 #14 0x000000000040550a in _start () #15 0x0000007fbfffeb88 in ?? () #16 0x000000000000001c in ?? () #17 0x0000000000000011 in ?? () #18 0x0000007fbfffeffe in ?? () #19 0x0000007fbffff00e in ?? () #20 0x0000007fbffff01b in ?? () #21 0x0000007fbffff021 in ?? () #22 0x0000007fbffff028 in ?? () #23 0x0000007fbffff02b in ?? () #24 0x0000007fbffff042 in ?? () #25 0x0000007fbffff046 in ?? () #26 0x0000007fbffff05b in ?? () #27 0x0000007fbffff05e in ?? () #28 0x0000007fbffff06b in ?? () #29 0x0000007fbffff06e in ?? () #30 0x0000007fbffff07f in ?? () #31 0x0000007fbffff082 in ?? () #32 0x0000007fbffff08d in ?? () #33 0x0000007fbffff0f5 in ?? () #34 0x0000007fbffff0fc in ?? () #35 0x0000000000000000 in ?? () Looking at the error messages of the broker I found no occurrence of above discussed message: 'error BDB exception occurred while initializing store:' The recorded log looks like this: 2010-08-13 15:58:58 info Loaded Module: /usr/lib64/qpid/daemon/acl.so 2010-08-13 15:58:58 info Loaded Module: /usr/lib64/qpid/daemon/replication_exchange.so 2010-08-13 15:58:58 info Loaded Module: /usr/lib64/qpid/daemon/msgstore.so 2010-08-13 15:58:58 info Loaded Module: /usr/lib64/qpid/daemon/ssl.so 2010-08-13 15:58:58 info Loaded Module: /usr/lib64/qpid/daemon/replicating_listener.so 2010-08-13 15:58:58 info Loaded Module: /usr/lib64/qpid/daemon/xml.so 2010-08-13 15:58:58 info No message store configured, persistence is disabled. 2010-08-13 15:58:58 info Management enabled 2010-08-13 15:58:58 info ManagementAgent generated broker ID: 00d2b040-1ed2-45b0-a9c7-5a2b7e29c02b 2010-08-13 15:58:58 info Registered replication exchange I believe this proves that fix is not efficient enough. -> ASSIGNED Note from comment 6 above that the fix came in two parts. Only the exception handling part was included into 1.3, and thus the odds of this error occurring are unchanged. However, the way in which the exception in comment 7 unfolded was unexpected. The basic failure was in BDB; either the BDB environment or the transaction begin failed. However, this seems not to be handed in the way I was expecting. I will investigate this aspect further. I recommend pushing this bug out of 1.3 at this point. In any event, the fix for the cause of the bug is not in 1.3 either - see comment 6 above. Additional error handling included in r.4212. I found one case of an improperly handled exception while initializing BDB, this is the only way that the trace seen above can be explained (that I can see). Changed target milestone to 1.4. The fix for the cause of this is on trunk, but has not been included into 1.3 (although the improved exception handling has, because of its relatively low risk). Once MRG rebases, this fix will get into the build. This issue has been fixed Verified on RHEL5.6 and RHEL6.1, architectures: i386, x86_64 bz456272.py executed in loop 1800 times Test duration: over 100 hours Result: no broker crash occurred packages installed: python-qpid-0.10-1.el6.noarch python-qpid-qmf-0.10-6.el6.x86_64 qpid-cpp-client-0.10-3.el6.x86_64 qpid-cpp-client-devel-0.10-3.el6.x86_64 qpid-cpp-client-devel-docs-0.10-3.el6.noarch qpid-cpp-client-rdma-0.10-3.el6.x86_64 qpid-cpp-client-ssl-0.10-3.el6.x86_64 qpid-cpp-server-0.10-3.el6.x86_64 qpid-cpp-server-cluster-0.10-3.el6.x86_64 qpid-cpp-server-devel-0.10-3.el6.x86_64 qpid-cpp-server-rdma-0.10-3.el6.x86_64 qpid-cpp-server-ssl-0.10-3.el6.x86_64 qpid-cpp-server-store-0.10-3.el6.x86_64 qpid-cpp-server-xml-0.10-3.el6.x86_64 qpid-java-client-0.10-4.el6.noarch qpid-java-common-0.10-4.el6.noarch qpid-java-example-0.10-4.el6.noarch qpid-java-jca-0.10-4.el6.noarch qpid-qmf-0.10-6.el6.x86_64 qpid-qmf-devel-0.10-6.el6.x86_64 qpid-tests-0.10-1.el6.noarch qpid-tools-0.10-3.el6.noarch rh-qpid-cpp-tests-0.10-3.el6.x86_64 rh-tests-distribution-MRG-Messaging-qpid_broker_jfail_bz456272-1.5-8.noarch rh-tests-distribution-MRG-Messaging-qpid_common-1.6-56.noarch ruby-qpid-0.7.946106-2.el6.x86_64 ruby-qpid-qmf-0.10-6.el6.x86_64 -> VERIFIED An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2011-0890.html |