Bug 1333437 - segfault when rerouting and deleting a queue in parallel
Summary: segfault when rerouting and deleting a queue in parallel
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 3.2
Hardware: All
OS: Linux
high
high
Target Milestone: 3.2.2
: ---
Assignee: Alan Conway
QA Contact: Messaging QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-05-05 13:36 UTC by Pavel Moravec
Modified: 2016-10-11 07:36 UTC (History)
5 users (show)

Fixed In Version: qpid-cpp-0.34-11
Doc Type: Bug Fix
Doc Text:
Cause: Missing locks around queue management object Consequence: Occasional broker crash with concurrent queue delete and management operations. Fix: Added appropriate locks. Result: No crash.
Clone Of:
Environment:
Last Closed: 2016-10-11 07:36:36 UTC
Target Upstream Version:


Attachments (Terms of Use)
create_queue.cpp (2.95 KB, text/x-csrc)
2016-05-05 13:39 UTC, Pavel Moravec
no flags Details
delete_queue.cpp (2.90 KB, text/x-csrc)
2016-05-05 13:40 UTC, Pavel Moravec
no flags Details
reroute_queue.cpp (2.73 KB, text/x-csrc)
2016-05-05 13:41 UTC, Pavel Moravec
no flags Details
Faster reproducer (4.70 KB, text/x-csrc)
2016-05-11 21:44 UTC, Alan Conway
no flags Details
Fixed bug in fast reproducer (4.74 KB, text/x-csrc)
2016-05-12 16:55 UTC, Alan Conway
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:2049 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging 3.2.2 Bug Fix Release 2016-10-11 11:36:01 UTC

Description Pavel Moravec 2016-05-05 13:36:18 UTC
Description of problem:
When broker is deleting a queue while performing a QMF queue method (i.e. reroute, but I guess any can be used), there is a race condition the actions interfere each other causing segfault. One possible backtrace:



Version-Release number of selected component (if applicable):
qpid-cpp-server 0.34-10


How reproducible:
100% in 30 minutes


Steps to Reproduce:
run script:

for i in $(seq 1 100); do
	while true; do
		/root/Cpp_MRG/QMF_examples/create_queue q_${i} > /dev/null
		/root/Cpp_MRG/QMF_examples/reroute_queue q_${i} 0 amq.fanout > /dev/null &
		/root/Cpp_MRG/QMF_examples/delete_queue q_${i} > /dev/null &
		wait
		sleep 0.1
	done &
done
wait

programs [create|reroute|delete]_queue to be attached.


Actual results:
segfault within several minutes


Expected results:
no segfault


Additional info:
one possible backtrace:

(gdb) bt
#0  0x000000326110a1c4 in qmf::org::apache::qpid::broker::Queue::doMethod (this=0x7ff23ba39530, methodName=<value optimized out>, inMap=std::map with 3 elements = {...}, outMap=
    std::map with 0 elements, userId="anonymous") at /usr/src/debug/qpid-cpp-0.34/src/qmf/org/apache/qpid/broker/Queue.cpp:1161
#1  0x0000003261313d06 in qpid::management::ManagementAgent::handleMethodRequest (this=0x119f9f0, body=<value optimized out>, rte="", rtk=
    "6e080197-b362-4bf7-840f-9802e9f745dd#QueueRerouterResponseQueue", cid="", userId="anonymous", viaLocal=true)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/management/ManagementAgent.cpp:1447
#2  0x0000003261320fd5 in qpid::management::ManagementAgent::dispatchAgentCommand (this=0x119f9f0, msg=..., viaLocal=true)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/management/ManagementAgent.cpp:2347
#3  0x00000032613218b8 in qpid::management::ManagementAgent::dispatchCommand (this=0x119f9f0, deliverable=<value optimized out>, routingKey="broker", topic=false, qmfVersion=2)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/management/ManagementAgent.cpp:1255
#4  0x0000003261334179 in qpid::broker::ManagementDirectExchange::route (this=0x11a06c0, msg=...) at /usr/src/debug/qpid-cpp-0.34/src/qpid/management/ManagementDirectExchange.cpp:48
#5  0x00000032612cdcf4 in qpid::broker::SemanticState::route (this=0x7ff260832428, msg=..., strategy=...) at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/SemanticState.cpp:506
#6  0x00000032612eafda in qpid::broker::SessionState::handleContent (this=0x7ff260832260, frame=<value optimized out>) at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/SessionState.cpp:233
#7  0x00000032612eb5a1 in qpid::broker::SessionState::handleIn (this=0x7ff260832260, frame=...) at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/SessionState.cpp:293
#8  0x0000003260790af1 in qpid::amqp_0_10::SessionHandler::handleIn (this=0x7ff1b30e30a0, f=...) at /usr/src/debug/qpid-cpp-0.34/src/qpid/amqp_0_10/SessionHandler.cpp:93
#9  0x000000326125bc9b in operator() (this=0x7ff271c410d0, frame=...) at /usr/src/debug/qpid-cpp-0.34/src/qpid/framing/Handler.h:39
#10 qpid::broker::ConnectionHandler::handle (this=0x7ff271c410d0, frame=...) at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/ConnectionHandler.cpp:93
#11 0x0000003261256a58 in qpid::broker::amqp_0_10::Connection::received (this=0x7ff271c40ef0, frame=...) at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/amqp_0_10/Connection.cpp:198
#12 0x00000032611e4563 in qpid::amqp_0_10::Connection::decode (this=0x7ff271c3fcd0, buffer=<value optimized out>, size=<value optimized out>)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/amqp_0_10/Connection.cpp:59
#13 0x00000032607b9ce0 in qpid::sys::AsynchIOHandler::readbuff (this=0x7ff1d160a790, buff=0x7ff271c3f370) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/AsynchIOHandler.cpp:138
#14 0x0000003260737ea9 in operator() (this=0x7ff271c40110, h=...) at /usr/include/boost/function/function_template.hpp:1013
#15 qpid::sys::posix::AsynchIO::readable (this=0x7ff271c40110, h=...) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/posix/AsynchIO.cpp:453
#16 0x00000032607be5f3 in boost::function1<void, qpid::sys::DispatchHandle&>::operator() (this=<value optimized out>, a0=<value optimized out>)
    at /usr/include/boost/function/function_template.hpp:1013
#17 0x00000032607bcf52 in qpid::sys::DispatchHandle::processEvent (this=0x7ff271c40118, type=qpid::sys::Poller::READ_WRITABLE)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/DispatchHandle.cpp:286
#18 0x000000326075de5d in process (this=0x118f9a0) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/Poller.h:131
#19 qpid::sys::Poller::run (this=0x118f9a0) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/epoll/EpollPoller.cpp:522
#20 0x000000326075213a in qpid::sys::(anonymous namespace)::runRunnable (p=<value optimized out>) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/posix/Thread.cpp:35
#21 0x0000003af8807a51 in start_thread () from /lib64/libpthread.so.0
#22 0x0000003af84e896d in clone () from /lib64/libc.so.6
(gdb)

(here broker deleted Queue object but coreObject = mgmtObject of the Queue was attempted to be accessed)


another segfault:

(gdb) bt
#0  0x0000003f6f033d64 in abort () from /lib64/libc.so.6
#1  0x0000003f710bea7d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#2  0x0000003f710bcbd6 in ?? () from /usr/lib64/libstdc++.so.6
#3  0x0000003f710bcc03 in std::terminate() () from /usr/lib64/libstdc++.so.6
#4  0x0000003f710bd55f in __cxa_pure_virtual () from /usr/lib64/libstdc++.so.6
#5  0x00007f3c16971d99 in qpid::broker::Queue::remove (this=0x164a8a0, maxCount=0, p=..., f=..., type=qpid::broker::CONSUMER, triggerAutoDelete=false, maxTests=0)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/Queue.cpp:750
#6  0x00007f3c169731f1 in qpid::broker::Queue::purge (this=0x164a8a0, qty=0, dest=..., filter=<value optimized out>) at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/Queue.cpp:796
#7  0x00007f3c16973f1e in qpid::broker::Queue::ManagementMethod (this=0x164a8a0, methodId=<value optimized out>, args=..., etext=<value optimized out>)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/Queue.cpp:1471
#8  0x00007f3c168491f3 in qmf::org::apache::qpid::broker::Queue::doMethod (this=0x18f3140, methodName=<value optimized out>, inMap=std::map with 3 elements = {...}, outMap=
    std::map with 0 elements, userId="anonymous@QPID") at /usr/src/debug/qpid-cpp-0.34/src/qmf/org/apache/qpid/broker/Queue.cpp:1163
#9  0x00007f3c16a52d06 in qpid::management::ManagementAgent::handleMethodRequest (this=0x14fcb90, body=<value optimized out>, rte="", rtk=
    "7d78a4cf-f2fa-4bcf-858e-5662372ce32a#reply-queue", cid="", userId="anonymous@QPID", viaLocal=true) at /usr/src/debug/qpid-cpp-0.34/src/qpid/management/ManagementAgent.cpp:1447
#10 0x00007f3c16a5ffd5 in qpid::management::ManagementAgent::dispatchAgentCommand (this=0x14fcb90, msg=..., viaLocal=true)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/management/ManagementAgent.cpp:2347
#11 0x00007f3c16a608b8 in qpid::management::ManagementAgent::dispatchCommand (this=0x14fcb90, deliverable=<value optimized out>, routingKey="broker", topic=false, qmfVersion=2)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/management/ManagementAgent.cpp:1255
#12 0x00007f3c16a73179 in qpid::broker::ManagementDirectExchange::route (this=0x1511340, msg=...) at /usr/src/debug/qpid-cpp-0.34/src/qpid/management/ManagementDirectExchange.cpp:48
#13 0x00007f3c16a0ccf4 in qpid::broker::SemanticState::route (this=0x7f3bfc4d8288, msg=..., strategy=...) at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/SemanticState.cpp:506
#14 0x00007f3c16a29fda in qpid::broker::SessionState::handleContent (this=0x7f3bfc4d80c0, frame=<value optimized out>) at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/SessionState.cpp:233
#15 0x00007f3c16a2a5a1 in qpid::broker::SessionState::handleIn (this=0x7f3bfc4d80c0, frame=...) at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/SessionState.cpp:293
#16 0x00007f3c16446af1 in qpid::amqp_0_10::SessionHandler::handleIn (this=0x7f3bfc302e00, f=...) at /usr/src/debug/qpid-cpp-0.34/src/qpid/amqp_0_10/SessionHandler.cpp:93
#17 0x00007f3c1699ac9b in operator() (this=0x7f3bfc27b0f0, frame=...) at /usr/src/debug/qpid-cpp-0.34/src/qpid/framing/Handler.h:39
#18 qpid::broker::ConnectionHandler::handle (this=0x7f3bfc27b0f0, frame=...) at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/ConnectionHandler.cpp:93
#19 0x00007f3c16995a58 in qpid::broker::amqp_0_10::Connection::received (this=0x7f3bfc27af10, frame=...) at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/amqp_0_10/Connection.cpp:198
#20 0x00007f3c16923563 in qpid::amqp_0_10::Connection::decode (this=0x7f3bfc4f69e0, buffer=<value optimized out>, size=<value optimized out>)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/amqp_0_10/Connection.cpp:59
#21 0x00007f3c1646fce0 in qpid::sys::AsynchIOHandler::readbuff (this=0x7f3bfc531660, buff=0x7f3bfc79e7a0) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/AsynchIOHandler.cpp:138
#22 0x00007f3c163edea9 in operator() (this=0x7f3bfc079690, h=...) at /usr/include/boost/function/function_template.hpp:1013
#23 qpid::sys::posix::AsynchIO::readable (this=0x7f3bfc079690, h=...) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/posix/AsynchIO.cpp:453
#24 0x00007f3c164745f3 in boost::function1<void, qpid::sys::DispatchHandle&>::operator() (this=<value optimized out>, a0=<value optimized out>)
    at /usr/include/boost/function/function_template.hpp:1013
#25 0x00007f3c16473286 in qpid::sys::DispatchHandle::processEvent (this=0x7f3bfc079698, type=qpid::sys::Poller::READABLE) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/DispatchHandle.cpp:280
#26 0x00007f3c16413e5d in process (this=0x14fb7c0) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/Poller.h:131
#27 qpid::sys::Poller::run (this=0x14fb7c0) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/epoll/EpollPoller.cpp:522
#28 0x00007f3c1640813a in qpid::sys::(anonymous namespace)::runRunnable (p=<value optimized out>) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/posix/Thread.cpp:35
#29 0x0000003f6f407a51 in start_thread () from /lib64/libpthread.so.0
#30 0x0000003f6f0e896d in clone () from /lib64/libc.so.6
(gdb)

(here, from the other side, Queue class being deleted while accessed from QMF method execution)

Comment 1 Pavel Moravec 2016-05-05 13:39:25 UTC
Created attachment 1154210 [details]
create_queue.cpp

Comment 2 Pavel Moravec 2016-05-05 13:40:34 UTC
Created attachment 1154211 [details]
delete_queue.cpp

Comment 3 Pavel Moravec 2016-05-05 13:41:52 UTC
Created attachment 1154212 [details]
reroute_queue.cpp

Comment 4 Alan Conway 2016-05-11 19:34:11 UTC
The following commits are possible fixes but unconfirmed as yet. On branch 

http://git.app.eng.bos.redhat.com/git/rh-qpid.git/log/?h=0.34-mrg-meteorcomm-aconway


664350d Bug 1333437 - Conditional compile mismatch in broker and common libs.
f7f133f Bug 1333767 - Use shared_ptr for Link::destroy instead of weak_ptr
c322c49 Bug 1333767 - Memory management error in Link/Bridge

commit 664350ddc7c074dd96a5ffd021b1fdf1046c1485
Author: Alan Conway <aconway@redhat.com>
Commit: Alan Conway <aconway@redhat.com>

    Bug 1333437 - Conditional compile mismatch in broker and common libs.
    
    Removed _IN_QPID_BROKER compile definition:
    
    - inconsistently set for libqpidcommon (compiled .cpp files) and libqpidbroker (uses .h) files
      - The broker has a binary incompatible notion of what is in the library.
    
    - It sort-of works by accident:
       - shared_ptr<T> only contains a T*, the mismatch is effectively doing reinterpret_cast<T*>
       - plain T* works for op->, but doesn't guarantee no concurrent deletes.
       - we create and destroy shared_ptr in libraries with _IN_QPID_BROKER set so
         we get cleanup, and no cores if we're lucky but there is limited protection from races.
    
    - only used by management:
      - I think exposing non-shared ptr GetObject was a feature that never materialized,
      - if not we need a separate function or class for non-shared-ptr users.

commit f7f133f6651a3938654e7e8d14c511c16015f8b5
Author: Alan Conway <aconway@redhat.com>
Commit: Alan Conway <aconway@redhat.com>

    Bug 1333767 - Use shared_ptr for Link::destroy instead of weak_ptr
    
    The Bridge and Link::ioCallback functions can safely be dropped if it is too
    late but Link::destroy needs to be called, so use a shared_ptr for that.

commit c322c49f97f489b6a32d69790bb7d34dcfb6639e
Author: Alan Conway <aconway@redhat.com>
Commit: Alan Conway <aconway@redhat.com>

    Bug 1333767 - Memory management error in Link/Bridge
    
    Fixed a memory bug consistent with causing the stack in
    
        https://bugzilla.redhat.com/show_bug.cgi?id=1333767
    
    and several other customer and internally-generated stack traces.
    The fault is hard to reproduce so the fix is not definitely confirmed.
    
    qpid::broker Link and Bridge use Connection::requestIOProcessing() to register
    callbacks in the connection thread. They were binding a plain "this" pointer to
    the callback, but the classes are managed by boost::shared_ptr so if all the
    shared_ptr were released, the callback could happen on a dangling pointer.
    
    This fix uses boost::weak_ptr in the callbacks, so if all shared_ptr instances
    are released, we don't use the dead pointer.
[aconway@wallace rh-qpid (0.34-mrg-meteorcomm-aconway=)]$ git log -3 --pretty=short
commit 664350ddc7c074dd96a5ffd021b1fdf1046c1485
Author: Alan Conway <aconway@redhat.com>

    Bug 1333437 - Conditional compile mismatch in broker and common libs.

commit f7f133f6651a3938654e7e8d14c511c16015f8b5
Author: Alan Conway <aconway@redhat.com>

    Bug 1333767 - Use shared_ptr for Link::destroy instead of weak_ptr

commit c322c49f97f489b6a32d69790bb7d34dcfb6639e
Author: Alan Conway <aconway@redhat.com>

    Bug 1333767 - Memory management error in Link/Bridge
[aconway@wallace rh-qpid (0.34-mrg-meteorcomm-aconway=)]$ git log -3 --pretty=medium
commit 664350ddc7c074dd96a5ffd021b1fdf1046c1485
Author: Alan Conway <aconway@redhat.com>
Date:   Wed May 11 10:18:32 2016 -0400

    Bug 1333437 - Conditional compile mismatch in broker and common libs.
    
    Removed _IN_QPID_BROKER compile definition:
    
    - inconsistently set for libqpidcommon (compiled .cpp files) and libqpidbroker (uses .h) files
      - The broker has a binary incompatible notion of what is in the library.
    
    - It sort-of works by accident:
       - shared_ptr<T> only contains a T*, the mismatch is effectively doing reinterpret_cast<T*>
       - plain T* works for op->, but doesn't guarantee no concurrent deletes.
       - we create and destroy shared_ptr in libraries with _IN_QPID_BROKER set so
         we get cleanup, and no cores if we're lucky but there is limited protection from races.
    
    - only used by management:
      - I think exposing non-shared ptr GetObject was a feature that never materialized,
      - if not we need a separate function or class for non-shared-ptr users.

commit f7f133f6651a3938654e7e8d14c511c16015f8b5
Author: Alan Conway <aconway@redhat.com>
Date:   Wed May 11 10:30:31 2016 -0400

    Bug 1333767 - Use shared_ptr for Link::destroy instead of weak_ptr
    
    The Bridge and Link::ioCallback functions can safely be dropped if it is too
    late but Link::destroy needs to be called, so use a shared_ptr for that.

commit c322c49f97f489b6a32d69790bb7d34dcfb6639e
Author: Alan Conway <aconway@redhat.com>
Date:   Tue May 10 17:33:49 2016 -0400

    Bug 1333767 - Memory management error in Link/Bridge
    
    Fixed a memory bug consistent with causing the stack in
    
        https://bugzilla.redhat.com/show_bug.cgi?id=1333767
    
    and several other customer and internally-generated stack traces.
    The fault is hard to reproduce so the fix is not definitely confirmed.
    
    qpid::broker Link and Bridge use Connection::requestIOProcessing() to register
    callbacks in the connection thread. They were binding a plain "this" pointer to
    the callback, but the classes are managed by boost::shared_ptr so if all the
    shared_ptr were released, the callback could happen on a dangling pointer.
    
    This fix uses boost::weak_ptr in the callbacks, so if all shared_ptr instances
    are released, we don't use the dead pointer.

Comment 5 Alan Conway 2016-05-11 21:44:05 UTC
Created attachment 1156315 [details]
Faster reproducer

Fast reproducer, generates similar cores to the ones reported above in < 1 minute.

Compile with:

g++ -std=c++11 -g  -L/usr/local/lib64 -lqpidmessaging -lqpidtypes -lpthread stress.cpp -o stress`

Start a broker and run.

Runs one thread in a create/delete loop and 4 threads doing management methods concurrently. Threads don't lose time reconnecting each time around so they keep the broker in the dangerous code more continuously.

This confirms that the previous fix is not complete but I think I know what remains to be done (unsafe back-pointer ManagementObject::coreObject)

Comment 6 Alan Conway 2016-05-11 22:37:19 UTC
Added a fix to 
http://git.app.eng.bos.redhat.com/git/rh-qpid.git/log/?h=0.34-mrg-meteorcomm-aconway

With this fix the stress.cpp reproducer is not crashing, and the output shows 
"resource-deleted" exceptions where previously we would have crashed. Need further confirmation but I think this may be fixed.

commit 0cca6c5fcc5944c8af26ef1a843d1db7bd0e96da
Author: Alan Conway <aconway@redhat.com>
Date:   Wed May 11 18:24:07 2016 -0400

    Bug 1333767 - Locking around Manageable* coreObject in ManagedObject.
    
    ManagedObject has a back-pointer to the Manageable object it belongs to.
    The Manageable calls ManagedObject::resourceDestroy() when it is deleted,
    but there were no locks so this was thread-unsafe and caused dangling pointers.
    
    Added a locked wrapper class around coreObject so destroy is atomic with respect
    to calls on coreObject.

Comment 7 Alan Conway 2016-05-12 16:55:12 UTC
Created attachment 1156734 [details]
Fixed bug in fast reproducer

The reproducer was not accepting messages from the response queue so eventually it timed out with "queue to long" messages on the broker. Fixed it so it should run indefinitely.

Comment 11 errata-xmlrpc 2016-10-11 07:36:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2049.html


Note You need to log in before you can comment on or make changes to this bug.