Bug 849654

Summary: qpidd crashes around qpid::amqp_0_10::_encode under qpid_ptest_stress test (Unexpected exception: Owner died (../include/qpid/sys/posix/Mutex.h:116))
Product: Red Hat Enterprise MRG Reporter: Frantisek Reznicek <freznice>
Component: qpid-cppAssignee: Andrew Stitcher <astitcher>
Status: CLOSED CURRENTRELEASE QA Contact: Frantisek Reznicek <freznice>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: DevelopmentCC: esammons, gsim, iboverma, jross, lzhaldyb, mcressma
Target Milestone: 2.2   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qpid-cpp-0.14-22 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-12-07 17:41:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Suggested fix none

Description Frantisek Reznicek 2012-08-20 13:32:25 UTC
Description of problem:

qpidd crashes around qpid::amqp_0_10::_encode under qpid_ptest_stress test (Unexpected exception: Owner died (../include/qpid/sys/posix/Mutex.h:116))

qpid_ptest_stress runs repetively multiple clients (low level API / high level API, QMF consoles) in parallel.

Running this test [1] resulted in qpidd crash:

  [root@ibm-x3550m3-01 qpid_ptest_stress]# ll -h ./core.28703
  -rw------- 1 root root 334M Aug 19 12:58 ./core.28703
  [root@ibm-x3550m3-01 qpid_ptest_stress]# gdb  `which qpidd` ./core.28703
  ...
  warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7fff29b0c000
  Core was generated by `qpidd --auth=no -p 5672 --data-dir=/mnt/tests/distribution/MRG/Messaging/qpid_p'.
  Program terminated with signal 11, Segmentation fault.
  #0  0x00000038736715ed in _int_malloc () from /lib64/libc.so.6
  (gdb) info threads
    10 Thread 0x2ba3e74b6040 (LWP 28703)  0x00000038736d37d8 in epoll_wait () from /lib64/libc.so.6
    9 Thread 28705  0x0000003873e0b1c0 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
    8 Thread 28706  0x00000038736d37d8 in epoll_wait () from /lib64/libc.so.6
    7 Thread 28707  0x00000038736d37d8 in epoll_wait () from /lib64/libc.so.6
    6 Thread 28708  0x00000038736d37d8 in epoll_wait () from /lib64/libc.so.6
    5 Thread 28709  0x00000038736d37d8 in epoll_wait () from /lib64/libc.so.6
    4 Thread 28710  0x00000038736d37d8 in epoll_wait () from /lib64/libc.so.6
    3 Thread 28712  0x00000038736d37d8 in epoll_wait () from /lib64/libc.so.6
    2 Thread 28713  0x00000038736d37d8 in epoll_wait () from /lib64/libc.so.6
  * 1 Thread 0x4520f940 (LWP 28711)  0x00000038736715ed in _int_malloc () from /lib64/libc.so.6
  (gdb) t a a bt
  
  Thread 10 (Thread 0x2ba3e74b6040 (LWP 28703)):
  #0  0x00000038736d37d8 in epoll_wait () from /lib64/libc.so.6
  #1  0x0000003874736431 in qpid::sys::Poller::wait (this=0x1a96dc80, timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:568
  #2  0x0000003874736ea7 in qpid::sys::Poller::run (this=0x1a96dc80) at qpid/sys/epoll/EpollPoller.cpp:520
  #3  0x0000003876739293 in qpid::broker::Broker::run (this=<value optimized out>) at qpid/broker/Broker.cpp:405
  #4  0x000000000040741c in QpiddBroker::execute (this=<value optimized out>, options=0x1a966190) at posix/QpiddBroker.cpp:195
  #5  0x00000000004058d7 in run_broker (argc=8, argv=0x7fff29aeaef8, hidden=<value optimized out>) at qpidd.cpp:83
  #6  0x000000387361d994 in __libc_start_main () from /lib64/libc.so.6
  #7  0x0000000000405329 in _start ()
  
  Thread 9 (Thread 28705):
  #0  0x0000003873e0b1c0 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  #1  0x0000003874815bea in qpid::sys::Timer::run (this=0xffffffffffffff92) at ../include/qpid/sys/posix/Condition.h:69
  #2  0x000000387472dfaa in qpid::sys::(anonymous namespace)::runRunnable (p=0x1a9d94e4) at qpid/sys/posix/Thread.cpp:35
  #3  0x0000003873e0677d in start_thread () from /lib64/libpthread.so.0
  #4  0x00000038736d33ed in clone () from /lib64/libc.so.6
  
  Thread 8 (Thread 28706):
  #0  0x00000038736d37d8 in epoll_wait () from /lib64/libc.so.6
  #1  0x0000003874736431 in qpid::sys::Poller::wait (this=0x1a96dc80, timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:568
  #2  0x0000003874736ea7 in qpid::sys::Poller::run (this=0x1a96dc80) at qpid/sys/epoll/EpollPoller.cpp:520
  #3  0x000000387472dfaa in qpid::sys::(anonymous namespace)::runRunnable (p=0x6) at qpid/sys/posix/Thread.cpp:35
  #4  0x0000003873e0677d in start_thread () from /lib64/libpthread.so.0
  #5  0x00000038736d33ed in clone () from /lib64/libc.so.6
  
  Thread 7 (Thread 28707):
  #0  0x00000038736d37d8 in epoll_wait () from /lib64/libc.so.6
  #1  0x0000003874736431 in qpid::sys::Poller::wait (this=0x1a96dc80, timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:568
  #2  0x0000003874736ea7 in qpid::sys::Poller::run (this=0x1a96dc80) at qpid/sys/epoll/EpollPoller.cpp:520
  #3  0x000000387472dfaa in qpid::sys::(anonymous namespace)::runRunnable (p=0x6) at qpid/sys/posix/Thread.cpp:35
  #4  0x0000003873e0677d in start_thread () from /lib64/libpthread.so.0
  #5  0x00000038736d33ed in clone () from /lib64/libc.so.6
  
  Thread 6 (Thread 28708):
  #0  0x00000038736d37d8 in epoll_wait () from /lib64/libc.so.6
  #1  0x0000003874736431 in qpid::sys::Poller::wait (this=0x1a96dc80, timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:568
  #2  0x0000003874736ea7 in qpid::sys::Poller::run (this=0x1a96dc80) at qpid/sys/epoll/EpollPoller.cpp:520
  #3  0x000000387472dfaa in qpid::sys::(anonymous namespace)::runRunnable (p=0x6) at qpid/sys/posix/Thread.cpp:35
  #4  0x0000003873e0677d in start_thread () from /lib64/libpthread.so.0
  #5  0x00000038736d33ed in clone () from /lib64/libc.so.6
  
  Thread 5 (Thread 28709):
  #0  0x00000038736d37d8 in epoll_wait () from /lib64/libc.so.6
  #1  0x0000003874736431 in qpid::sys::Poller::wait (this=0x1a96dc80, timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:568
  #2  0x0000003874736ea7 in qpid::sys::Poller::run (this=0x1a96dc80) at qpid/sys/epoll/EpollPoller.cpp:520
  #3  0x000000387472dfaa in qpid::sys::(anonymous namespace)::runRunnable (p=0x6) at qpid/sys/posix/Thread.cpp:35
  #4  0x0000003873e0677d in start_thread () from /lib64/libpthread.so.0
  #5  0x00000038736d33ed in clone () from /lib64/libc.so.6
  
  Thread 4 (Thread 28710):
  #0  0x00000038736d37d8 in epoll_wait () from /lib64/libc.so.6
  #1  0x0000003874736431 in qpid::sys::Poller::wait (this=0x1a96dc80, timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:568
  #2  0x0000003874736ea7 in qpid::sys::Poller::run (this=0x1a96dc80) at qpid/sys/epoll/EpollPoller.cpp:520
  #3  0x000000387472dfaa in qpid::sys::(anonymous namespace)::runRunnable (p=0x6) at qpid/sys/posix/Thread.cpp:35
  #4  0x0000003873e0677d in start_thread () from /lib64/libpthread.so.0
  #5  0x00000038736d33ed in clone () from /lib64/libc.so.6
  
  Thread 3 (Thread 28712):
  #0  0x00000038736d37d8 in epoll_wait () from /lib64/libc.so.6
  #1  0x0000003874736431 in qpid::sys::Poller::wait (this=0x1a96dc80, timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:568
  #2  0x0000003874736ea7 in qpid::sys::Poller::run (this=0x1a96dc80) at qpid/sys/epoll/EpollPoller.cpp:520
  #3  0x000000387472dfaa in qpid::sys::(anonymous namespace)::runRunnable (p=0x6) at qpid/sys/posix/Thread.cpp:35
  #4  0x0000003873e0677d in start_thread () from /lib64/libpthread.so.0
  #5  0x00000038736d33ed in clone () from /lib64/libc.so.6
  
  Thread 2 (Thread 28713):
  #0  0x00000038736d37d8 in epoll_wait () from /lib64/libc.so.6
  #1  0x0000003874736431 in qpid::sys::Poller::wait (this=0x1a96dc80, timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:568
  #2  0x0000003874736ea7 in qpid::sys::Poller::run (this=0x1a96dc80) at qpid/sys/epoll/EpollPoller.cpp:520
  #3  0x000000387472dfaa in qpid::sys::(anonymous namespace)::runRunnable (p=0x6) at qpid/sys/posix/Thread.cpp:35
  #4  0x0000003873e0677d in start_thread () from /lib64/libpthread.so.0
  #5  0x00000038736d33ed in clone () from /lib64/libc.so.6
  
  Thread 1 (Thread 0x4520f940 (LWP 28711)):
  
  #0  0x00000038736715ed in _int_malloc () from /lib64/libc.so.6
  #1  0x00000038736738de in malloc () from /lib64/libc.so.6
  #2  0x0000003877abd25d in operator new(unsigned long) () from /usr/lib64/libstdc++.so.6
  #3  0x0000003877a9b861 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) () from /usr/lib64/libstdc++.so.6
  #4  0x0000003877a9d111 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_mutate(unsigned long, unsigned long, unsigned long) () from /usr/lib64/libstdc++.so.6
  #5  0x0000003877a9d28c in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_replace_safe(unsigned long, unsigned long, char const*, unsigned long) () from /usr/lib64/libstdc++.so.6
  #6  0x000000387481e04a in getData (value=<value optimized out>, data="", f=<value optimized out>) at qpid/amqp_0_10/Codecs.cpp:269
  #7  qpid::amqp_0_10::_encode<qpid::framing::FieldTable, std::map<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, qpid::types::Variant, std::less<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, qpid::types::Variant> > >, std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::shared_ptr<qpid::framing::FieldValue> > (*)(std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, qpid::types::Variant> const&)> (value=<value optimized out>, data="", f=<value optimized out>)
      at qpid/amqp_0_10/Codecs.cpp:289
  #8  0x00000038747fdf30 in qpid::management::Buffer::putMap (this=0x451fb570, i=std::map with 0 elements) at qpid/management/Buffer.cpp:75
  #9  0x00000038767127f9 in qmf::org::apache::qpid::broker::EventBind::writeSchema (schema="")
      at qmf/org/apache/qpid/broker/EventBind.cpp:90
  #10 0x000000387685beb5 in qpid::management::ManagementAgent::SchemaClass::appendSchema (this=0x1a9ec688, buf=...)
      at qpid/management/ManagementAgent.cpp:1590
  #11 0x00000038768670d1 in qpid::management::ManagementAgent::handleSchemaRequestLH (this=0x2aaaaaaab010, inBuffer=<value optimized out>,
      rte="amq.direct", rtk="qmfc-bdf4ff5e-e861-4347-99f9-4e3936633b31", sequence=10) at qpid/management/ManagementAgent.cpp:1618
  #12 0x0000003876886521 in qpid::management::ManagementAgent::dispatchAgentCommandLH (this=0x2aaaaaaab010, msg=..., viaLocal=false)
      at qpid/management/ManagementAgent.cpp:2296
  #13 0x0000003876886df6 in qpid::management::ManagementAgent::dispatchCommand (this=0x2aaaaaaab010, deliverable=<value optimized out>,
      routingKey="broker", topic=true, qmfVersion=1) at qpid/management/ManagementAgent.cpp:1204
  #14 0x0000003876890c2c in qpid::broker::ManagementTopicExchange::route (this=0x1a9f63e0, msg=..., routingKey="broker", args=0x1d3f0690)
      at qpid/management/ManagementTopicExchange.cpp:50
  #15 0x000000387680e7ea in qpid::broker::SemanticState::route (this=<value optimized out>, msg=..., strategy=...)
      at qpid/broker/SemanticState.cpp:495
  #16 0x000000387680f21d in qpid::broker::SemanticState::handle (this=0x2aaaae4f5b08, msg=...) at qpid/broker/SemanticState.cpp:449
  #17 0x0000003876839228 in qpid::broker::SessionState::handleContent (this=0x2aaaae4f5930, frame=..., id=<value optimized out>)
      at qpid/broker/SessionState.cpp:266
  #18 0x00000038768399b0 in qpid::broker::SessionState::handleIn (this=0x2aaaae4f5930, frame=...) at qpid/broker/SessionState.cpp:362
  #19 0x00000038747d5cd5 in qpid::amqp_0_10::SessionHandler::handleIn (this=0x2aaaae018620, f=...) at qpid/amqp_0_10/SessionHandler.cpp:93
  #20 0x0000003876761ca1 in operator() (this=0x2aaaae4eff38, frame=...) at qpid/framing/Handler.h:42
  #21 qpid::broker::ConnectionHandler::handle (this=0x2aaaae4eff38, frame=...) at qpid/broker/ConnectionHandler.cpp:87
  #22 0x0000003876756d48 in qpid::broker::Connection::received (this=0x2aaaae4efd50, frame=...) at qpid/broker/Connection.cpp:174
  #23 0x0000003876726824 in qpid::amqp_0_10::Connection::decode (this=0x2aaaaca75630, buffer=<value optimized out>,
      size=<value optimized out>) at qpid/amqp_0_10/Connection.cpp:58
  #24 0x0000003874808cbd in qpid::sys::AsynchIOHandler::readbuff (this=0x1b5a3910, buff=0x1cc20b20) at qpid/sys/AsynchIOHandler.cpp:164
  #25 0x000000387472bd4a in boost::function2<void, qpid::sys::AsynchIO&, qpid::sys::AsynchIOBufferBase*, std::allocator<boost::function_base> >::operator() (this=0x0, a0=..., a1=0x4340c9e0) at /usr/include/boost/function/function_template.hpp:576
  #26 0x0000003874729af0 in qpid::sys::posix::AsynchIO::readable (this=0x1c0e1090, h=...) at qpid/sys/posix/AsynchIO.cpp:446
  #27 0x000000387480e82a in boost::function1<void, qpid::sys::DispatchHandle&, std::allocator<boost::function_base> >::operator() (
      this=0x0, a0=...) at /usr/include/boost/function/function_template.hpp:576
  #28 0x000000387480df2f in qpid::sys::DispatchHandle::processEvent (this=0x1c0e1098, type=READABLE) at qpid/sys/DispatchHandle.cpp:280
  #29 0x0000003874736ed4 in process (this=0x1a96dc80) at qpid/sys/Poller.h:131
  #30 qpid::sys::Poller::run (this=0x1a96dc80) at qpid/sys/epoll/EpollPoller.cpp:524
  #31 0x000000387472dfaa in qpid::sys::(anonymous namespace)::runRunnable (p=0x38739529e0) at qpid/sys/posix/Thread.cpp:35
  #32 0x0000003873e0677d in start_thread () from /lib64/libpthread.so.0
  #33 0x00000038736d33ed in clone () from /lib64/libc.so.6
  (gdb) quit

Crashes seen twice on RHEL 5.8 x86_64.


Version-Release number of selected component (if applicable):
  [root@ibm-x3550m3-01 qpid_ptest_stress]# rpm -qa | grep qpid | grep -v MRG-distributio | sort
  python-qpid-0.14-11.el5
  python-qpid-qmf-0.14-14.el5
  qpid-cpp-client-0.14-21.el5
  qpid-cpp-client-devel-0.14-21.el5
  qpid-cpp-client-devel-docs-0.14-21.el5
  qpid-cpp-client-rdma-0.14-21.el5
  qpid-cpp-client-ssl-0.14-21.el5
  qpid-cpp-mrg-debuginfo-0.14-21.el5
  qpid-cpp-server-0.14-21.el5
  qpid-cpp-server-cluster-0.14-21.el5
  qpid-cpp-server-devel-0.14-21.el5
  qpid-cpp-server-rdma-0.14-21.el5
  qpid-cpp-server-ssl-0.14-21.el5
  qpid-cpp-server-store-0.14-21.el5
  qpid-cpp-server-xml-0.14-21.el5
  qpid-java-client-0.18-1.el5
  qpid-java-common-0.18-1.el5
  qpid-java-example-0.18-1.el5
  qpid-qmf-0.14-14.el5
  qpid-qmf-debuginfo-0.14-14.el5
  qpid-qmf-devel-0.14-14.el5
  qpid-tests-0.14-1.el5
  qpid-tools-0.14-5.el5
  rh-qpid-cpp-tests-0.14-21.el5
  ruby-qpid-qmf-0.14-14.el5


How reproducible:
10%

Steps to Reproduce:
1. run the test
2. wait for crash

Actual results:
qpidd crashes.

Expected results:
qpidd should not crash.

Additional info:

Comment 2 Frantisek Reznicek 2012-08-20 13:34:35 UTC
Test raw log stripe:

[12:58:37] Periodic report
[12:58:37] ==========================================[err_cnt:0; cnts: 214]
[12:58:37] pids:18273 27527 14506 14509 14507 28702 28703 28705 28706 28707 28708 28709 28710 28711 28712 28713 28741 13708 13714 13715 287
42 10186 10300 10313 10370 10385 10431 10442 10476 28747 14110 14114 14219 14220 14221 14222 28752 28761 14375 14377 28765 14232 14271 1428
8 14290 14291 14295 14296 14297 14306 14333 14334 14335 14336 14337 14338 14339 14340 14341 14342 14343 14344 14345 14346 14347 14348 14349
 14350 14356 14357 14362 14382 14389 14395 14415 14436 14437 14438 14444 14446 14448 14451 14452 14453 14454 14457 14459 14460 14462 14465
28772 28778 14488 28804 14514 14523 0 14521 0 14522 14515 14517 14519 28809 14508 28826 0 14467 0 14481 28827 28828 0 12360 0 12378 27529 ,
 detailed thread stats:
  thread_1spout1drain_cpp.log:0  thread_1spout1drain_cpppython.log:0  thread_1spout1drain_python.log:0  thread_client_test.log:0  thread_ex
change_cpp.log:0  thread_exchange_python.log:0  thread_perftest.log:0[12:58:37]
[12:58:37] . . . . . . . .functions.sh: line 380: 28703 Segmentation fault      (core dumped) qpidd ${qpidd_params} -p ${l_port[${i}]} --da
ta-dir=${PWD}/dd${l_port[${i}]} --log-to-file=qpidd.p${l_port[${i}]}.log --log-enable info+ >&/dev/null

Broker log file stripe:
...
2012-08-19 12:58:38 info Queue "thread_queue_py_886f4202f9e4fea2af611f1642f84a08_8": Policy created: type=reject; maxCount=0; maxSize=10485
7600
2012-08-19 12:58:38 info Queue "thread_queue_py_886f4202f9e4fea2af611f1642f84a08_8": Flow limit created: flowStopCount=0, flowResumeCount=0
, flowStopSize=83886080, flowResumeSize=73400320
2012-08-19 12:58:38 info Queue "MyQueue": Policy created: type=reject; maxCount=0; maxSize=104857600
2012-08-19 12:58:38 info Queue "MyQueue": Flow limit created: flowStopCount=0, flowResumeCount=0, flowStopSize=83886080, flowResumeSize=734
00320
2012-08-19 12:58:38 info Queue "qmfc-bdf4ff5e-e861-4347-99f9-4e3936633b31": Policy created: type=reject; maxCount=0; maxSize=104857600
2012-08-19 12:58:38 info Queue "qmfc-bdf4ff5e-e861-4347-99f9-4e3936633b31": Flow limit created: flowStopCount=0, flowResumeCount=0, flowSto
pSize=83886080, flowResumeSize=73400320
2012-08-19 12:58:38 error Unexpected exception: Owner died (../include/qpid/sys/posix/Mutex.h:116)
2012-08-19 12:58:38 error Connection 127.0.0.1:5672-127.0.0.1:46149 closed by error: Owner died (../include/qpid/sys/posix/Mutex.h:116)(501
)

Comment 3 Gordon Sim 2012-08-20 14:50:48 UTC
The error message may not be directly related to the core dump. The exception is caught and handled it seems. The backtrace is in management code. I think this would be better handled by someone more familiar with the backported patches in this build.

Comment 10 Andrew Stitcher 2012-08-21 14:29:07 UTC
I agree with Gordon's earlier comment - the reported exception doesn't seem directly related to the original stack trace which is consistent with the idea of memory corruption.

The valgrind logs show a very small amount of memory leaked which is probably not related, although it is perhaps worth investigating on its own, however the valgrind options weren't turned high enough to report the individual leaks.

I think the best way to isolate this issue may be to bisect the dev history of the code and see exactly which change introduced the bug.

Comment 11 Andrew Stitcher 2012-08-21 22:41:13 UTC
I've replicated this problem on the same test machine a few times, but then the problem steadfastly refused to show (with the identical software). This is going to make it rather tricky to attempt to bisect this problem.

I wonder if there is any way to get a more cut down test case.

Comment 12 Justin Ross 2012-08-27 14:13:47 UTC
*** Bug 851087 has been marked as a duplicate of this bug. ***

Comment 13 Andrew Stitcher 2012-08-27 21:55:12 UTC
I've now run a bisection on the entire 0.14 release branch and found that this bug (or possibly this and one very similar to it) are present in the entire branch. As far as I know this means that the 2.1 release also suffered the same problem.

In each case the qpidd process core dumped within a very few minutes of the beginning of the test.

As a comparison I also ran the test on the tip of our 0.12 release branch and the test did not fail within about an hour.

So I think it is fair to conclude that this is a bug(s) that were introduced between 0.12 and 0.14 roughly speaking.

There seemed a number of common elements to the stack traces of the crashed brokers:

Most threads were waiting to do something, so the broker wasn't madly busy with lots of different connections.

Another aspect common to many of the core dumps is that in many there is a thread waiting on a mutex with this stack trace (or something like it):

#0  0x00000036fe20d7a4 in __lll_robust_lock_wait () from /lib64/libpthread.so.0
#1  0x00000036fe208e54 in _L_robust_lock_392 () from /lib64/libpthread.so.0
#2  0x00000036fe20898e in __pthread_mutex_lock_full (mutex=0x2aaab8000188) at pthread_mutex_lock.c:221
#3  0x00002ba760c26530 in qpid::sys::Mutex::lock (this=0x2aaab8000188)
    at ../../git-src/qpid/cpp/include/qpid/sys/posix/Mutex.h:116
#4  0x00002ba760cae660 in ScopedLock (this=0x2aaab8000088)
    at ../../git-src/qpid/cpp/include/qpid/sys/Mutex.h:33
#5  qpid::broker::Message::getApplicationHeaders (this=0x2aaab8000088)
    at ../../git-src/qpid/cpp/src/qpid/broker/Message.cpp:99
#6  0x00002ba760cb5abc in qpid::broker::MessageMap::getKey (this=0x0, message=...)
    at ../../git-src/qpid/cpp/src/qpid/broker/MessageMap.cpp:32
#7  0x00002ba760cb6bfd in qpid::broker::MessageMap::erase (this=0x0, i=<value optimized out>)
    at ../../git-src/qpid/cpp/src/qpid/broker/MessageMap.cpp:160
#8  0x00002ba760cb6d3d in qpid::broker::MessageMap::removeIf (this=0x2aaaac9d89a0, p=...)
    at ../../git-src/qpid/cpp/src/qpid/broker/MessageMap.cpp:153
#9  0x00002ba760c9882b in qpid::broker::LegacyLVQ::removeIf (this=0x2aaaac9d89a0, p=<value optimized out>)
    at ../../git-src/qpid/cpp/src/qpid/broker/LegacyLVQ.cpp:97
#10 0x00002ba760ccf860 in qpid::broker::Queue::purge (this=0x2aaaac94c730, 
    purge_request=<value optimized out>, dest=..., filter=<value optimized out>)
    at ../../git-src/qpid/cpp/src/qpid/broker/Queue.cpp:629
#11 0x00002ba760d1f726 in qpid::broker::SessionAdapter::QueueHandlerImpl::purge (
    this=<value optimized out>, queue="TEST.RECEIVER.LV.0")
    at ../../git-src/qpid/cpp/src/qpid/broker/SessionAdapter.cpp:342
#12 0x00002ba7611b5e12 in ~Invoker (this=0x2ba760d1f640, body=<value optimized out>)
    at ./qpid/framing/ServerInvoker.h:218
#13 qpid::framing::AMQP_ServerOperations::Invoker::visit (this=0x2ba760d1f640, body=<value optimized out>)
    at qpid/framing/ServerInvoker.cpp:344
#14 0x0000000042f5dc30 in ?? ()
#15 0x0000000000000000 in ?? ()

If you inspect frame 8 you will that we are in the middle of iterating
through messages (in line 153)

148	
149	void MessageMap::removeIf(Predicate p)
150	{
151	    for (Ordering::iterator i = messages.begin(); i != messages.end(); i++) {
152	        if (p(i->second)) {
153	            erase(i);
154	        }
155	    }
156	}
157	

However the messages collection has 0 elements:
(gdb) p *this
$2 = {<qpid::broker::Messages> = {_vptr.Messages = 0x2ba761007f10}, key = "qpid.LVQ_key", 
  index = std::map with 0 elements, messages = std::map with 0 elements}

The only way I can think of to make this consistent is that after this thread started iterating some other thread went in and removed all the elements from index and messages.

This is not present in all of the core dumps but certainly is in many of them - I suspect it's a big clue.

Comment 15 Andrew Stitcher 2012-08-28 15:03:20 UTC
Some further results on the suggestive stacktrace:

Out of 7 core dumps 6 include this stacktrace.

Comment 18 Gordon Sim 2012-09-04 18:48:12 UTC
The MessageMap::removeIf() implementation for the 0.14 release branch does indeed look suspect (http://mrg1.lab.bos.redhat.com/cgit/qpid.git/tree/qpid/cpp/src/qpid/broker/MessageMap.cpp?h=0.14-mrg-preview-2).

The iterator is incremented even when an element is erased. Any purging of an LVQ would seem to have potential to cause memory corruption. (Fixing the loop to post-increment the iterator as it is passed to erase would avoid this problem at least; that is already the case on 0.18 and trunk no longer has that method at all).

Comment 19 Gordon Sim 2012-09-04 20:35:50 UTC
Simple reproducer:

  for k in `seq 1 100`; do ./src/tests/qpid-send -m 1 -a "lvq/key-$k; {create: always, node:{x-declare:{arguments:{qpid.last_value_queue_key:qpid.subject}}}}"; done

then purge the queue, e.g. via qpid-tool

Simple fix:

diff -up ./src/qpid/broker/MessageMap.cpp.orig ./src/qpid/broker/MessageMap.cpp
--- ./src/qpid/broker/MessageMap.cpp.orig	2012-09-04 16:28:16.000000000 -0400
+++ ./src/qpid/broker/MessageMap.cpp	2012-09-04 16:29:39.000000000 -0400
@@ -148,9 +148,11 @@ void MessageMap::foreach(Functor f)
 
 void MessageMap::removeIf(Predicate p)
 {
-    for (Ordering::iterator i = messages.begin(); i != messages.end(); i++) {
+    for (Ordering::iterator i = messages.begin(); i != messages.end();) {
         if (p(i->second)) {
-            erase(i);
+            erase(i++);
+        } else {
+            ++i;
         }
     }
 }

Note: This was tested and fixed against a pristine qpid 0.14 release as my attempts to clone mrg repo did not succeed this evening. However I believe it is entirely applicable to the 0.14 preview branch.

Comment 20 Gordon Sim 2012-09-04 20:44:19 UTC
Created attachment 609777 [details]
Suggested fix

Attaching patch for simpler application.

Comment 24 Frantisek Reznicek 2012-09-12 10:43:52 UTC
Retesting is ongoing, final results by 2012-09-14.

Comment 25 Frantisek Reznicek 2012-09-13 07:39:51 UTC
The issue has been fixed, retested on multiple machines RHEL 5.8 / 6.3 i[36]86 / x86_64 (focus on RHEL5.8x) on packages:

el5
  python-qpid-0.14-11.el5.noarch
  python-qpid-qmf-0.14-14.el5.x86_64
  python-saslwrapper-0.10-4.el5.x86_64
  qpid-cpp-client-0.14-22.el5.x86_64
  qpid-cpp-client-devel-0.14-22.el5.x86_64
  qpid-cpp-client-devel-docs-0.14-22.el5.x86_64
  qpid-cpp-client-rdma-0.14-22.el5.x86_64
  qpid-cpp-client-ssl-0.14-22.el5.x86_64
  qpid-cpp-server-0.14-22.el5.x86_64
  qpid-cpp-server-cluster-0.14-22.el5.x86_64
  qpid-cpp-server-devel-0.14-22.el5.x86_64
  qpid-cpp-server-rdma-0.14-22.el5.x86_64
  qpid-cpp-server-ssl-0.14-22.el5.x86_64
  qpid-cpp-server-store-0.14-22.el5.x86_64
  qpid-cpp-server-xml-0.14-22.el5.x86_64
  qpid-java-client-0.18-2.el5.noarch
  qpid-java-common-0.18-2.el5.noarch
  qpid-java-example-0.18-2.el5.noarch
  qpid-jca-0.18-2.el5.noarch
  qpid-jca-xarecovery-0.18-2.el5.noarch
  qpid-jca-zip-0.18-2.el5.noarch
  qpid-qmf-0.14-14.el5.x86_64
  qpid-qmf-devel-0.14-14.el5.x86_64
  qpid-tests-0.14-1.el5.noarch
  qpid-tools-0.14-6.el5.noarch
  rh-qpid-cpp-tests-0.14-22.el5.x86_64
  ruby-qpid-qmf-0.14-14.el5.x86_64
  ruby-saslwrapper-0.10-4.el5.x86_64
  saslwrapper-0.10-4.el5.x86_64
  saslwrapper-devel-0.10-4.el5.x86_64
  sesame-1.0-4.el5.x86_64
el6
  python-qpid-0.14-11.el6_3.noarch
  python-qpid-qmf-0.14-14.el6_3.x86_64
  python-saslwrapper-0.14-1.el6.x86_64
  qpid-cpp-client-0.14-22.el6_3.x86_64
  qpid-cpp-client-devel-0.14-22.el6_3.x86_64
  qpid-cpp-client-devel-docs-0.14-22.el6_3.noarch
  qpid-cpp-client-rdma-0.14-22.el6_3.x86_64
  qpid-cpp-client-ssl-0.14-22.el6_3.x86_64
  qpid-cpp-server-0.14-22.el6_3.x86_64
  qpid-cpp-server-cluster-0.14-22.el6_3.x86_64
  qpid-cpp-server-devel-0.14-22.el6_3.x86_64
  qpid-cpp-server-rdma-0.14-22.el6_3.x86_64
  qpid-cpp-server-ssl-0.14-22.el6_3.x86_64
  qpid-cpp-server-store-0.14-22.el6_3.x86_64
  qpid-cpp-server-xml-0.14-22.el6_3.x86_64
  qpid-java-client-0.18-2.el6.noarch
  qpid-java-common-0.18-2.el6.noarch
  qpid-java-example-0.18-2.el6.noarch
  qpid-jca-0.18-2.el6.noarch
  qpid-jca-xarecovery-0.18-2.el6.noarch
  qpid-jca-zip-0.18-2.el6.noarch
  qpid-qmf-0.14-14.el6_3.x86_64
  qpid-qmf-devel-0.14-14.el6_3.x86_64
  qpid-tests-0.14-1.el6_2.noarch
  qpid-tools-0.14-6.el6_3.noarch
  rh-qpid-cpp-tests-0.14-22.el6_3.x86_64
  ruby-qpid-qmf-0.14-14.el6_3.x86_64
  ruby-saslwrapper-0.14-1.el6.x86_64
  saslwrapper-0.14-1.el6.x86_64
  saslwrapper-devel-0.14-1.el6.x86_64
  sesame-1.0-6.el6.x86_64

-> VERIFIED