Bug 849654
Summary: | qpidd crashes around qpid::amqp_0_10::_encode under qpid_ptest_stress test (Unexpected exception: Owner died (../include/qpid/sys/posix/Mutex.h:116)) | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Frantisek Reznicek <freznice> | ||||
Component: | qpid-cpp | Assignee: | Andrew Stitcher <astitcher> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Frantisek Reznicek <freznice> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | Development | CC: | esammons, gsim, iboverma, jross, lzhaldyb, mcressma | ||||
Target Milestone: | 2.2 | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | qpid-cpp-0.14-22 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2012-12-07 17:41:22 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Frantisek Reznicek
2012-08-20 13:32:25 UTC
Test raw log stripe: [12:58:37] Periodic report [12:58:37] ==========================================[err_cnt:0; cnts: 214] [12:58:37] pids:18273 27527 14506 14509 14507 28702 28703 28705 28706 28707 28708 28709 28710 28711 28712 28713 28741 13708 13714 13715 287 42 10186 10300 10313 10370 10385 10431 10442 10476 28747 14110 14114 14219 14220 14221 14222 28752 28761 14375 14377 28765 14232 14271 1428 8 14290 14291 14295 14296 14297 14306 14333 14334 14335 14336 14337 14338 14339 14340 14341 14342 14343 14344 14345 14346 14347 14348 14349 14350 14356 14357 14362 14382 14389 14395 14415 14436 14437 14438 14444 14446 14448 14451 14452 14453 14454 14457 14459 14460 14462 14465 28772 28778 14488 28804 14514 14523 0 14521 0 14522 14515 14517 14519 28809 14508 28826 0 14467 0 14481 28827 28828 0 12360 0 12378 27529 , detailed thread stats: thread_1spout1drain_cpp.log:0 thread_1spout1drain_cpppython.log:0 thread_1spout1drain_python.log:0 thread_client_test.log:0 thread_ex change_cpp.log:0 thread_exchange_python.log:0 thread_perftest.log:0[12:58:37] [12:58:37] . . . . . . . .functions.sh: line 380: 28703 Segmentation fault (core dumped) qpidd ${qpidd_params} -p ${l_port[${i}]} --da ta-dir=${PWD}/dd${l_port[${i}]} --log-to-file=qpidd.p${l_port[${i}]}.log --log-enable info+ >&/dev/null Broker log file stripe: ... 2012-08-19 12:58:38 info Queue "thread_queue_py_886f4202f9e4fea2af611f1642f84a08_8": Policy created: type=reject; maxCount=0; maxSize=10485 7600 2012-08-19 12:58:38 info Queue "thread_queue_py_886f4202f9e4fea2af611f1642f84a08_8": Flow limit created: flowStopCount=0, flowResumeCount=0 , flowStopSize=83886080, flowResumeSize=73400320 2012-08-19 12:58:38 info Queue "MyQueue": Policy created: type=reject; maxCount=0; maxSize=104857600 2012-08-19 12:58:38 info Queue "MyQueue": Flow limit created: flowStopCount=0, flowResumeCount=0, flowStopSize=83886080, flowResumeSize=734 00320 2012-08-19 12:58:38 info Queue "qmfc-bdf4ff5e-e861-4347-99f9-4e3936633b31": Policy created: type=reject; maxCount=0; maxSize=104857600 2012-08-19 12:58:38 info Queue "qmfc-bdf4ff5e-e861-4347-99f9-4e3936633b31": Flow limit created: flowStopCount=0, flowResumeCount=0, flowSto pSize=83886080, flowResumeSize=73400320 2012-08-19 12:58:38 error Unexpected exception: Owner died (../include/qpid/sys/posix/Mutex.h:116) 2012-08-19 12:58:38 error Connection 127.0.0.1:5672-127.0.0.1:46149 closed by error: Owner died (../include/qpid/sys/posix/Mutex.h:116)(501 ) The error message may not be directly related to the core dump. The exception is caught and handled it seems. The backtrace is in management code. I think this would be better handled by someone more familiar with the backported patches in this build. I agree with Gordon's earlier comment - the reported exception doesn't seem directly related to the original stack trace which is consistent with the idea of memory corruption. The valgrind logs show a very small amount of memory leaked which is probably not related, although it is perhaps worth investigating on its own, however the valgrind options weren't turned high enough to report the individual leaks. I think the best way to isolate this issue may be to bisect the dev history of the code and see exactly which change introduced the bug. I've replicated this problem on the same test machine a few times, but then the problem steadfastly refused to show (with the identical software). This is going to make it rather tricky to attempt to bisect this problem. I wonder if there is any way to get a more cut down test case. *** Bug 851087 has been marked as a duplicate of this bug. *** I've now run a bisection on the entire 0.14 release branch and found that this bug (or possibly this and one very similar to it) are present in the entire branch. As far as I know this means that the 2.1 release also suffered the same problem. In each case the qpidd process core dumped within a very few minutes of the beginning of the test. As a comparison I also ran the test on the tip of our 0.12 release branch and the test did not fail within about an hour. So I think it is fair to conclude that this is a bug(s) that were introduced between 0.12 and 0.14 roughly speaking. There seemed a number of common elements to the stack traces of the crashed brokers: Most threads were waiting to do something, so the broker wasn't madly busy with lots of different connections. Another aspect common to many of the core dumps is that in many there is a thread waiting on a mutex with this stack trace (or something like it): #0 0x00000036fe20d7a4 in __lll_robust_lock_wait () from /lib64/libpthread.so.0 #1 0x00000036fe208e54 in _L_robust_lock_392 () from /lib64/libpthread.so.0 #2 0x00000036fe20898e in __pthread_mutex_lock_full (mutex=0x2aaab8000188) at pthread_mutex_lock.c:221 #3 0x00002ba760c26530 in qpid::sys::Mutex::lock (this=0x2aaab8000188) at ../../git-src/qpid/cpp/include/qpid/sys/posix/Mutex.h:116 #4 0x00002ba760cae660 in ScopedLock (this=0x2aaab8000088) at ../../git-src/qpid/cpp/include/qpid/sys/Mutex.h:33 #5 qpid::broker::Message::getApplicationHeaders (this=0x2aaab8000088) at ../../git-src/qpid/cpp/src/qpid/broker/Message.cpp:99 #6 0x00002ba760cb5abc in qpid::broker::MessageMap::getKey (this=0x0, message=...) at ../../git-src/qpid/cpp/src/qpid/broker/MessageMap.cpp:32 #7 0x00002ba760cb6bfd in qpid::broker::MessageMap::erase (this=0x0, i=<value optimized out>) at ../../git-src/qpid/cpp/src/qpid/broker/MessageMap.cpp:160 #8 0x00002ba760cb6d3d in qpid::broker::MessageMap::removeIf (this=0x2aaaac9d89a0, p=...) at ../../git-src/qpid/cpp/src/qpid/broker/MessageMap.cpp:153 #9 0x00002ba760c9882b in qpid::broker::LegacyLVQ::removeIf (this=0x2aaaac9d89a0, p=<value optimized out>) at ../../git-src/qpid/cpp/src/qpid/broker/LegacyLVQ.cpp:97 #10 0x00002ba760ccf860 in qpid::broker::Queue::purge (this=0x2aaaac94c730, purge_request=<value optimized out>, dest=..., filter=<value optimized out>) at ../../git-src/qpid/cpp/src/qpid/broker/Queue.cpp:629 #11 0x00002ba760d1f726 in qpid::broker::SessionAdapter::QueueHandlerImpl::purge ( this=<value optimized out>, queue="TEST.RECEIVER.LV.0") at ../../git-src/qpid/cpp/src/qpid/broker/SessionAdapter.cpp:342 #12 0x00002ba7611b5e12 in ~Invoker (this=0x2ba760d1f640, body=<value optimized out>) at ./qpid/framing/ServerInvoker.h:218 #13 qpid::framing::AMQP_ServerOperations::Invoker::visit (this=0x2ba760d1f640, body=<value optimized out>) at qpid/framing/ServerInvoker.cpp:344 #14 0x0000000042f5dc30 in ?? () #15 0x0000000000000000 in ?? () If you inspect frame 8 you will that we are in the middle of iterating through messages (in line 153) 148 149 void MessageMap::removeIf(Predicate p) 150 { 151 for (Ordering::iterator i = messages.begin(); i != messages.end(); i++) { 152 if (p(i->second)) { 153 erase(i); 154 } 155 } 156 } 157 However the messages collection has 0 elements: (gdb) p *this $2 = {<qpid::broker::Messages> = {_vptr.Messages = 0x2ba761007f10}, key = "qpid.LVQ_key", index = std::map with 0 elements, messages = std::map with 0 elements} The only way I can think of to make this consistent is that after this thread started iterating some other thread went in and removed all the elements from index and messages. This is not present in all of the core dumps but certainly is in many of them - I suspect it's a big clue. Some further results on the suggestive stacktrace: Out of 7 core dumps 6 include this stacktrace. The MessageMap::removeIf() implementation for the 0.14 release branch does indeed look suspect (http://mrg1.lab.bos.redhat.com/cgit/qpid.git/tree/qpid/cpp/src/qpid/broker/MessageMap.cpp?h=0.14-mrg-preview-2). The iterator is incremented even when an element is erased. Any purging of an LVQ would seem to have potential to cause memory corruption. (Fixing the loop to post-increment the iterator as it is passed to erase would avoid this problem at least; that is already the case on 0.18 and trunk no longer has that method at all). Simple reproducer: for k in `seq 1 100`; do ./src/tests/qpid-send -m 1 -a "lvq/key-$k; {create: always, node:{x-declare:{arguments:{qpid.last_value_queue_key:qpid.subject}}}}"; done then purge the queue, e.g. via qpid-tool Simple fix: diff -up ./src/qpid/broker/MessageMap.cpp.orig ./src/qpid/broker/MessageMap.cpp --- ./src/qpid/broker/MessageMap.cpp.orig 2012-09-04 16:28:16.000000000 -0400 +++ ./src/qpid/broker/MessageMap.cpp 2012-09-04 16:29:39.000000000 -0400 @@ -148,9 +148,11 @@ void MessageMap::foreach(Functor f) void MessageMap::removeIf(Predicate p) { - for (Ordering::iterator i = messages.begin(); i != messages.end(); i++) { + for (Ordering::iterator i = messages.begin(); i != messages.end();) { if (p(i->second)) { - erase(i); + erase(i++); + } else { + ++i; } } } Note: This was tested and fixed against a pristine qpid 0.14 release as my attempts to clone mrg repo did not succeed this evening. However I believe it is entirely applicable to the 0.14 preview branch. Created attachment 609777 [details]
Suggested fix
Attaching patch for simpler application.
Retesting is ongoing, final results by 2012-09-14. The issue has been fixed, retested on multiple machines RHEL 5.8 / 6.3 i[36]86 / x86_64 (focus on RHEL5.8x) on packages: el5 python-qpid-0.14-11.el5.noarch python-qpid-qmf-0.14-14.el5.x86_64 python-saslwrapper-0.10-4.el5.x86_64 qpid-cpp-client-0.14-22.el5.x86_64 qpid-cpp-client-devel-0.14-22.el5.x86_64 qpid-cpp-client-devel-docs-0.14-22.el5.x86_64 qpid-cpp-client-rdma-0.14-22.el5.x86_64 qpid-cpp-client-ssl-0.14-22.el5.x86_64 qpid-cpp-server-0.14-22.el5.x86_64 qpid-cpp-server-cluster-0.14-22.el5.x86_64 qpid-cpp-server-devel-0.14-22.el5.x86_64 qpid-cpp-server-rdma-0.14-22.el5.x86_64 qpid-cpp-server-ssl-0.14-22.el5.x86_64 qpid-cpp-server-store-0.14-22.el5.x86_64 qpid-cpp-server-xml-0.14-22.el5.x86_64 qpid-java-client-0.18-2.el5.noarch qpid-java-common-0.18-2.el5.noarch qpid-java-example-0.18-2.el5.noarch qpid-jca-0.18-2.el5.noarch qpid-jca-xarecovery-0.18-2.el5.noarch qpid-jca-zip-0.18-2.el5.noarch qpid-qmf-0.14-14.el5.x86_64 qpid-qmf-devel-0.14-14.el5.x86_64 qpid-tests-0.14-1.el5.noarch qpid-tools-0.14-6.el5.noarch rh-qpid-cpp-tests-0.14-22.el5.x86_64 ruby-qpid-qmf-0.14-14.el5.x86_64 ruby-saslwrapper-0.10-4.el5.x86_64 saslwrapper-0.10-4.el5.x86_64 saslwrapper-devel-0.10-4.el5.x86_64 sesame-1.0-4.el5.x86_64 el6 python-qpid-0.14-11.el6_3.noarch python-qpid-qmf-0.14-14.el6_3.x86_64 python-saslwrapper-0.14-1.el6.x86_64 qpid-cpp-client-0.14-22.el6_3.x86_64 qpid-cpp-client-devel-0.14-22.el6_3.x86_64 qpid-cpp-client-devel-docs-0.14-22.el6_3.noarch qpid-cpp-client-rdma-0.14-22.el6_3.x86_64 qpid-cpp-client-ssl-0.14-22.el6_3.x86_64 qpid-cpp-server-0.14-22.el6_3.x86_64 qpid-cpp-server-cluster-0.14-22.el6_3.x86_64 qpid-cpp-server-devel-0.14-22.el6_3.x86_64 qpid-cpp-server-rdma-0.14-22.el6_3.x86_64 qpid-cpp-server-ssl-0.14-22.el6_3.x86_64 qpid-cpp-server-store-0.14-22.el6_3.x86_64 qpid-cpp-server-xml-0.14-22.el6_3.x86_64 qpid-java-client-0.18-2.el6.noarch qpid-java-common-0.18-2.el6.noarch qpid-java-example-0.18-2.el6.noarch qpid-jca-0.18-2.el6.noarch qpid-jca-xarecovery-0.18-2.el6.noarch qpid-jca-zip-0.18-2.el6.noarch qpid-qmf-0.14-14.el6_3.x86_64 qpid-qmf-devel-0.14-14.el6_3.x86_64 qpid-tests-0.14-1.el6_2.noarch qpid-tools-0.14-6.el6_3.noarch rh-qpid-cpp-tests-0.14-22.el6_3.x86_64 ruby-qpid-qmf-0.14-14.el6_3.x86_64 ruby-saslwrapper-0.14-1.el6.x86_64 saslwrapper-0.14-1.el6.x86_64 saslwrapper-devel-0.14-1.el6.x86_64 sesame-1.0-6.el6.x86_64 -> VERIFIED |