Bug 1333767

Summary: Occasional segfault when creating&deleting many queues just before primary failover
Product: Red Hat Enterprise MRG Reporter: Pavel Moravec <pmoravec>
Component: qpid-cppAssignee: Alan Conway <aconway>
Status: CLOSED ERRATA QA Contact: Messaging QE <messaging-qe-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 3.2CC: aconway, jross, mcressma, pmoravec, tkratky, zkraus
Target Milestone: 3.2.2   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: qpid-cpp-0.34-16 Doc Type: Bug Fix
Doc Text:
Cause: Interrupted recovery on backup left invalid data behind causing memory errors. Consequence: Backup crash during failover/recovery. Fix: Fixed clean up if recovery is interrupted. Improved memory safety of some other potential problems observed during the fix. Result: No crash.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-11 07:36:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
drain_many_consumers.cpp none

Description Pavel Moravec 2016-05-06 11:39:02 UTC
Description of problem:
Creating and deleting many queues in parallel in active-passive HA cluster, while primary service is shutting down / relocating, a seldom segfault can occur in qpid::broker::Bridge::create .


Version-Release number of selected component (if applicable):
qpid-cpp-server 0.34-10


How reproducible:
seldomly (once per hour?)


Steps to Reproduce:
0. compile attached C++ program to emulate fast queue creating+deleting (an AMQP session creating many consumers to queues with "{create:always, delete:always}" and unsubscribing from all of them at the end)

1. start 3 brokers in HA cluster with --link-maintenance-interval=10 (script is below) - this option just attempts to trigger the segfault faster

2. run in one terminal:
./failover_primary_delete_queues.sh

3. run in another terminal:
ulimit -c 0 # drain_many_consumers will coredump, ignore it

while true; do for port in 5672 5673; do /root/Cpp_MRG/drain_many_consumers -c 2000 -t 2 -b localhost:${port} qqq_qqq & done; wait; sleep 1; done

4. wait until some backup broker (most probably 5674) coredumps - worth having ulimit -c unlimited before the test


Actual results:
4. after some time, a broker segfaults

Expected results:
4. no segfault


Additional info:
Backtrace (top of stack may differ, always it is in qpid::broker::Bridge::create):

#0  0x0000003e0c889a93 in memcpy () from /lib64/libc.so.6
#1  0x0000003e0f09d1e6 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Rep::_M_clone(std::allocator<char> const&, unsigned long) () from /usr/lib64/libstdc++.so.6
#2  0x0000003e0f09d28c in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/lib64/libstdc++.so.6
#3  0x0000003e12588405 in qpid::SessionId::SessionId (this=0x7f208dfed160, u=<value optimized out>, n=
) at /usr/src/debug/qpid-cpp-0.34/src/qpid/SessionId.cpp:27
#4  0x0000003e0f6ee84e in qpid::broker::SessionHandler::attachAs (this=0x7f2074d03ac0, name=
) at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/SessionHandler.cpp:140
#5  0x0000003e0f6505af in qpid::broker::Bridge::create (this=0x3ba75c0, c=...)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/Bridge.cpp:150
#6  0x0000003e0f68ae11 in qpid::broker::Link::ioThreadProcessing (this=0x23422c0)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/Link.cpp:482
#7  0x0000003e0f656105 in operator() (this=0x7f20742a8c50) at /usr/include/boost/function/function_template.hpp:1013
#8  qpid::broker::amqp_0_10::Connection::doIoCallbacks (this=0x7f20742a8c50)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/amqp_0_10/Connection.cpp:388
#9  0x0000003e0f656aac in qpid::broker::amqp_0_10::Connection::received (this=0x7f20742a8c50, frame=...)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/amqp_0_10/Connection.cpp:204
#10 0x0000003e0f5e4563 in qpid::amqp_0_10::Connection::decode (this=0x7f20747e5130, buffer=<value optimized out>, 
    size=<value optimized out>) at /usr/src/debug/qpid-cpp-0.34/src/qpid/amqp_0_10/Connection.cpp:59
#11 0x0000003e125b9ce0 in qpid::sys::AsynchIOHandler::readbuff (this=0x7f207434e8e0, buff=0x7f20751e8100)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/AsynchIOHandler.cpp:138
#12 0x0000003e12537ea9 in operator() (this=0x7f20748c4410, h=...) at /usr/include/boost/function/function_template.hpp:1013
#13 qpid::sys::posix::AsynchIO::readable (this=0x7f20748c4410, h=...)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/posix/AsynchIO.cpp:453
#14 0x0000003e125be5f3 in boost::function1<void, qpid::sys::DispatchHandle&>::operator() (this=<value optimized out>, 
    a0=<value optimized out>) at /usr/include/boost/function/function_template.hpp:1013
#15 0x0000003e125bd286 in qpid::sys::DispatchHandle::processEvent (this=0x7f20748c4418, type=qpid::sys::Poller::READABLE)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/DispatchHandle.cpp:280
#16 0x0000003e1255de5d in process (this=0x2327d60) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/Poller.h:131
#17 qpid::sys::Poller::run (this=0x2327d60) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/epoll/EpollPoller.cpp:522
#18 0x0000003e1255213a in qpid::sys::(anonymous namespace)::runRunnable (p=<value optimized out>)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/posix/Thread.cpp:35
#19 0x0000003e0d007aa1 in start_thread () from /lib64/libpthread.so.0
#20 0x0000003e0c8e893d in clone () from /lib64/libc.so.6

Comment 1 Pavel Moravec 2016-05-06 11:44:26 UTC
Two reproducer scripts:

(/root/qpidd.acl allows everything, needed just to allow anonymous links)

# cat start_ha_brokers.sh 
start_broker() {
	port=$1
	shift
	rm -rf _${port}
	mkdir _${port}
	nohup qpidd --port=$port --store-dir=_${port} --log-to-file=qpidd.$port.log --data-dir=_${port} --auth=no --log-to-stderr=no --ha-cluster=yes --ha-brokers-url="localhost:5672,localhost:5673,localhost:5674" --ha-replicate=all --acl-file=/root/qpidd.acl --link-maintenance-interval=10 "$@" > /dev/null 2>&1 &
	sleep 1
}


killall qpidd 2> /dev/null
rm -f qpidd.*.log
start_broker 5672
sleep 1
qpid-ha promote -b localhost:5672 --cluster-manager
sleep 1
start_broker 5673
sleep 1
start_broker 5674
#



and



# cat ./failover_primary_delete_queues.sh
primary=${1:-5672}
queues=2000

while true; do
	date
	/root/Cpp_MRG/drain_many_consumers -c $queues -t 2 -b localhost:${primary} qqq
	kill $(ps aux | grep qpidd | grep "port=${primary}" | awk '{ print $2}')
	port=$primary
	primary=$(($((5672+5673))-$((primary))))
	qpid-ha promote -b localhost:${primary} --cluster-manager
	sleep 2
	qpidd --port=${port} --store-dir=_${port} --log-to-file=qpidd.${port}.log --data-dir=_${port} --auth=no --log-to-stderr=no --ha-cluster=yes --ha-brokers-url=localhost:5672,localhost:5673,localhost:5674 --ha-replicate=all --acl-file=/root/qpidd.acl & #--log-enable=debug+:Bridge --log-enable=notice+ --log-hires-timestamp=yes &
	sleep 5
done
#


Brokers on ports 5672 and 5673 are failing over primary, just after drain_many_consumers created 2000 queues and delete them in a batch (and running the same drain_many_consumers in parallel in another terminal)

Comment 2 Pavel Moravec 2016-05-06 11:45:19 UTC
Created attachment 1154591 [details]
drain_many_consumers.cpp

Comment 3 Pavel Moravec 2016-05-11 10:02:27 UTC
Btw. trying to more reliably reproduce the segfault (so far hit it just once), I often get segfault like:

(gdb) bt
#0  0x0000003e0c832625 in raise () from /lib64/libc.so.6
#1  0x0000003e0c833e05 in abort () from /lib64/libc.so.6
#2  0x0000003e0f0bea7d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#3  0x0000003e0f0bcbd6 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x0000003e0f0bcc03 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5  0x0000003e0f0bd55f in __cxa_pure_virtual () from /usr/lib64/libstdc++.so.6
#6  0x0000003e0f6eed16 in qpid::broker::SessionHandler::handleDetach (this=0x7fb4d4b5de20)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/SessionHandler.cpp:112
#7  0x0000003e0f6557fc in qpid::broker::amqp_0_10::Connection::closed (this=0x7fb4d4c4c190)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/amqp_0_10/Connection.cpp:374
#8  0x0000003e125b9345 in qpid::sys::AsynchIOHandler::~AsynchIOHandler (this=0x7fb4d472cd00, __in_chrg=<value optimized out>)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/AsynchIOHandler.cpp:70
#9  0x0000003e125b9439 in qpid::sys::AsynchIOHandler::~AsynchIOHandler (this=0x7fb4d472cd00, __in_chrg=<value optimized out>)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/AsynchIOHandler.cpp:74
#10 0x0000003e125ba940 in qpid::sys::AsynchIOHandler::closedSocket (this=0x7fb4d472cd00, s=...)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/AsynchIOHandler.cpp:196
#11 0x0000003e12536239 in operator() (this=0x7fb4d47bd6d0, h=<value optimized out>)
    at /usr/include/boost/function/function_template.hpp:1013
#12 qpid::sys::posix::AsynchIO::close (this=0x7fb4d47bd6d0, h=<value optimized out>)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/posix/AsynchIO.cpp:616
#13 0x0000003e12537c9e in qpid::sys::posix::AsynchIO::writeable (this=0x7fb4d47bd6d0, h=...)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/posix/AsynchIO.cpp:575
#14 0x0000003e125be5f3 in boost::function1<void, qpid::sys::DispatchHandle&>::operator() (this=<value optimized out>, 
    a0=<value optimized out>) at /usr/include/boost/function/function_template.hpp:1013
#15 0x0000003e125bcf5e in qpid::sys::DispatchHandle::processEvent (this=0x7fb4d47bd6d8, type=qpid::sys::Poller::WRITABLE)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/DispatchHandle.cpp:287
#16 0x0000003e1255de5d in process (this=0x1fb6bf0) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/Poller.h:131
#17 qpid::sys::Poller::run (this=0x1fb6bf0) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/epoll/EpollPoller.cpp:522
#18 0x0000003e1255213a in qpid::sys::(anonymous namespace)::runRunnable (p=<value optimized out>)
    at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/posix/Thread.cpp:35
#19 0x0000003e0d007aa1 in start_thread () from /lib64/libpthread.so.0
#20 0x0000003e0c8e893d in clone () from /lib64/libc.so.6

Comment 4 Alan Conway 2016-05-11 19:32:50 UTC
(In reply to Pavel Moravec from comment #3)
> Btw. trying to more reliably reproduce the segfault (so far hit it just
> once), I often get segfault like:
> 
> (gdb) bt
> #0  0x0000003e0c832625 in raise () from /lib64/libc.so.6
> #1  0x0000003e0c833e05 in abort () from /lib64/libc.so.6
> #2  0x0000003e0f0bea7d in __gnu_cxx::__verbose_terminate_handler() () from
> /usr/lib64/libstdc++.so.6
> #3  0x0000003e0f0bcbd6 in ?? () from /usr/lib64/libstdc++.so.6
> #4  0x0000003e0f0bcc03 in std::terminate() () from /usr/lib64/libstdc++.so.6
> #5  0x0000003e0f0bd55f in __cxa_pure_virtual () from
> /usr/lib64/libstdc++.so.6
> #6  0x0000003e0f6eed16 in qpid::broker::SessionHandler::handleDetach
> (this=0x7fb4d4b5de20)
>     at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/SessionHandler.cpp:112

handleDetach calls errorListener->detach(); and errorListener is a shared_ptr to a Bridge, so this crash is consistent with these fixes:

f7f133f Bug 1333767 - Use shared_ptr for Link::destroy instead of weak_ptr
c322c49 Bug 1333767 - Memory management error in Link/Bridge

I also have a fix for the initial stack in this bug report:

664350d Bug 1333437 - Conditional compile mismatch in broker and common libs.

None of the fixes are confirmed yet but they correspond well to the stack traces.
All are on branch:

http://git.app.eng.bos.redhat.com/git/rh-qpid.git/log/?h=0.34-mrg-meteorcomm-aconway

Commit comments:

commit 664350ddc7c074dd96a5ffd021b1fdf1046c1485
Author: Alan Conway <aconway>
Date:   Wed May 11 10:18:32 2016 -0400

    Bug 1333437 - Conditional compile mismatch in broker and common libs.
    
    Removed _IN_QPID_BROKER compile definition:
    
    - inconsistently set for libqpidcommon (compiled .cpp files) and libqpidbroker (uses .h) files
      - The broker has a binary incompatible notion of what is in the library.
    
    - It sort-of works by accident:
       - shared_ptr<T> only contains a T*, the mismatch is effectively doing reinterpret_cast<T*>
       - plain T* works for op->, but doesn't guarantee no concurrent deletes.
       - we create and destroy shared_ptr in libraries with _IN_QPID_BROKER set so
         we get cleanup, and no cores if we're lucky but there is limited protection from races.
    
    - only used by management:
      - I think exposing non-shared ptr GetObject was a feature that never materialized,
      - if not we need a separate function or class for non-shared-ptr users.

commit f7f133f6651a3938654e7e8d14c511c16015f8b5
Author: Alan Conway <aconway>
Date:   Wed May 11 10:30:31 2016 -0400

    Bug 1333767 - Use shared_ptr for Link::destroy instead of weak_ptr
    
    The Bridge and Link::ioCallback functions can safely be dropped if it is too
    late but Link::destroy needs to be called, so use a shared_ptr for that.

commit c322c49f97f489b6a32d69790bb7d34dcfb6639e
Author: Alan Conway <aconway>
Date:   Tue May 10 17:33:49 2016 -0400

    Bug 1333767 - Memory management error in Link/Bridge
    
    Fixed a memory bug consistent with causing the stack in
    
        https://bugzilla.redhat.com/show_bug.cgi?id=1333767
    
    and several other customer and internally-generated stack traces.
    The fault is hard to reproduce so the fix is not definitely confirmed.
    
    qpid::broker Link and Bridge use Connection::requestIOProcessing() to register
    callbacks in the connection thread. They were binding a plain "this" pointer to
    the callback, but the classes are managed by boost::shared_ptr so if all the
    shared_ptr were released, the callback could happen on a dangling pointer.
    
    This fix uses boost::weak_ptr in the callbacks, so if all shared_ptr instances
    are released, we don't use the dead pointer.

Comment 9 Alan Conway 2016-06-09 20:20:28 UTC
Another speculative fix, worth testing but still don't have a good reproducer. It's a missing piece from the previous fix.

http://git.app.eng.bos.redhat.com/git/rh-qpid.git/commit/?h=0.34-mrg-meteorcomm-aconway-2&id=bc179055287cd10fb58de03fecad4df647f075db

Bug 1333767 - Added resourceDestroy to Queue::~Queue.0.34-mrg-meteorcomm-aconway-2
Call resourceDestroy() in Queue::~Queue. The previous fix assumed that
Queue::destroyed() was always called before Queue::~Queue, stack traces
suggets that may not be true.

Renamed ManagedObject::CoreObject as ManagedObject::ManageablePtr to clarify its
role - it is simply an atomic wrapper to protect access to a Manageable*, not a
separate object.

Comment 10 Alan Conway 2016-06-10 20:21:21 UTC
Found a definite race if a queue is destroyed in 2 threads concurrently - Queue::destroyed() called without a lock. This fits our theory so far. 

This can happen to an auto-delete queue if 2 of the following are concurrent:

- management command from user (qpid-config etc.)
- last consumer unsubscribes (on primary)
- auto-delete timer runs out
- backup promoted (auto-delete is suppressed on backups, enabled on promote)

Working on fix, this might help with a reproducer.

Comment 11 Pavel Moravec 2016-06-11 11:01:48 UTC
(In reply to Alan Conway from comment #10)
> Found a definite race if a queue is destroyed in 2 threads concurrently -
> Queue::destroyed() called without a lock. This fits our theory so far. 
> 
> This can happen to an auto-delete queue if 2 of the following are concurrent:
> 
> - management command from user (qpid-config etc.)
> - last consumer unsubscribes (on primary)
> - auto-delete timer runs out
> - backup promoted (auto-delete is suppressed on backups, enabled on promote)
> 
> Working on fix, this might help with a reproducer.

So something like:

1) time to time, relocate primary qpidd

2) run script like:

queues=15
for i in $(seq 1 $queues); do
	while true; do
		rnd=$(uuidgen)
		qpid-receive -a "autoDelQueue_${i}_${rnd}; {create:always, node:{ x-declare:{auto-delete:True, alternate-exchange:amq.direct, arguments:{'qpid.auto_delete_timeout':1}}}}" -b "localhost:5672,localhost:5673,localhost:5674"
		sleep 0.9$(($(($((RANDOM))%5))+5))
		for port in 5672 5673 5674; do
			~/Cpp_MRG/QMF_examples/reroute_queue autoDelQueue_${i}_${rnd} 0 amq.fanout localhost:${port} 2> /dev/null &
		done
		wait
	done &
done

that (in parallel) creates autoDel queue and at the time of auto-deleting it, it calls reroute QMF method on the queue.


I tried that, no luck :(

Comment 12 Pavel Moravec 2016-06-11 11:11:16 UTC
(In reply to Alan Conway from comment #10)
> Found a definite race if a queue is destroyed in 2 threads concurrently -
> Queue::destroyed() called without a lock. This fits our theory so far. 
> 
> This can happen to an auto-delete queue if 2 of the following are concurrent:
> 
> - management command from user (qpid-config etc.)
> - last consumer unsubscribes (on primary)
> - auto-delete timer runs out
> - backup promoted (auto-delete is suppressed on backups, enabled on promote)

Just to clarify:
- backup broker gets the QMF command
- auto-del timer on the queue expires
- now the backup broker is promoted to primary
- parallel execution of Queue::destroyed() and qmf::org::apache::qpid::broker::Queue::doMethod() are executed concurrently on just-promoted broker - and this causes the segfault?

That should be covered by the reproducer, right?

Comment 13 Pavel Moravec 2016-06-12 07:23:47 UTC
Just a silly idea:

the problem comes when qpid::broker::Queue destructor is called while qpid::management::ManagementObject::CoreObject points to it. What about changing:

    class CoreObject {
        Manageable* ptr;

to:

    class CoreObject {
        boost::shared_ptr<Manageable> ptr;

to prevent the first destructor call (triggered by boost, I guess) since boost will still keep a reference to the Queue object?

Comment 14 Pavel Moravec 2016-06-12 13:01:45 UTC
Some further observations:

1) the queue double-deleted in customer's coredump is autodelete but _without_ autoDel timeout:

gdb -c _core.qpidd-01627293-1 /usr/sbin/qpidd
..
(gdb) frame 2
#2  0x000000354a10ac19 in qmf::org::apache::qpid::broker::Queue::doMethod (this=0x7f5b4c090230, methodName=<value optimized out>, 
    inMap=<value optimized out>, outMap=std::map with 0 elements, userId="anonymous")
    at /usr/src/debug/qpid-cpp-0.34/src/qmf/org/apache/qpid/broker/Queue.cpp:1161
1161	        bool allow = coreObject.AuthorizeMethod(METHOD_REROUTE, ioArgs, userId);
(gdb) p arguments
$1 = std::map with 6 elements = {
  ["qpid.ha-uuid"] = {impl = 0x7f5b4f8c7140},
  ["qpid.max_count"] = {impl = 0x7f5ba16ca990},
  ["qpid.max_size"] = {impl = 0x7f5b50eb3d30},
  ["qpid.policy_type"] = {impl = 0x7f5b4eaaa300},
  ["qpid.replicate"] = {impl = 0x7f5ba0b24fe0},
  ["x-qpid-priorities"] = {impl = 0x7f5ba0bfdf40}
}
(gdb) 


See no "qpid.auto_delete_timeout" argument there.


2) On the backup broker, the qpid::broker::Queue object - since not with autoDel timeout - must be deleted from BrokerReplicator::doEventQueueDelete where Queue::destroyed and CoreObject::destroy are called first - setting the ptr = 0 and preventing the segfault.

So the qpid::broker::Queue object was deleted after the broker was promoted to primary, imho.

Comment 15 Alan Conway 2016-06-13 13:40:56 UTC
(In reply to Pavel Moravec from comment #13)
> Just a silly idea:
> 
> the problem comes when qpid::broker::Queue destructor is called while
> qpid::management::ManagementObject::CoreObject points to it. What about
...
> 
>     class CoreObject {
>         boost::shared_ptr<Manageable> ptr;

Not silly, that was my first idea. The problem is Manageable is a base class for all types of manageable objects, and they don't all use shared_ptr. We would have to change them all to be consistent, which I wanted to avoid.

CoreObject acts a bit like weak_ptr (friend of shared_ptr) but using the existing resourceDestory() callback. However after you crashed my first patch I found race conditions around calling resourceDestroy as well. I've fixed those, but now you remind me, I think maybe it is worth doing the work to use weak_ptr in case there are more issues. It is simpler and safer because it is self-contained and does not depend on our code correctly calling functions like resourceDestroy()

Comment 16 Alan Conway 2016-06-13 13:50:50 UTC
(In reply to Pavel Moravec from comment #14)
> So the qpid::broker::Queue object was deleted after the broker was promoted
> to primary, imho.

My current pet theory is this: an auto-delete queue (does not require timeout) is created, subscribed and un-subscribed. The primary sends un-sub and delete to the backup and then fails. The backup is processing the primary delete when another thread receives the "promote" from the cluster manager. 

"promote" re-activates auto-delete (which is disabled in backup mode) and therefore destroys the queue (which is auto-del and has been unsubscribed) at the same time as the replication thread is processing the last command from the primary, which also destroys the queue.

How does that sound from your evidence? Unfortunately I've been running something all weekend and have not reproduced ;(

Comment 17 Alan Conway 2016-06-15 14:38:28 UTC
Pushed a fix for the hypothetical queue deletion race. Still no convincing reproducer.

http://git.app.eng.bos.redhat.com/git/rh-qpid.git/commit/?h=0.34-mrg-meteorcomm&id=ddf24b88fb8ad3f2200182ab31ecd70d185abc41

Bug 1333767 - Fix race conditions during Queue destruction.0.34-mrg-meteorcomm
Move calls to Queue::destroyed() into QueueRegistry::destroy().
Use QueueRegistry lock to guarantee it exactly one call.

Removed some dead code.

Call resourceDestroy() in Queue::~Queue if it was not called already. This is
probably redundant given given destroyed() fix but can't hurt.

Renamed ManagedObject::CoreObject as ManagedObject::ManageablePtr to clarify its
role as an atomic pointer wrapper, not a separate object.

Comment 18 Alan Conway 2016-06-16 21:10:39 UTC
Fixes to date for this BZ are under review upstream, tracked by https://issues.apache.org/jira/browse/QPID-7306

Comment 19 Alan Conway 2016-07-12 20:09:24 UTC
Created attachment 1178988 [details]
reproducer for core dump during failover.

start the cluster (ha-start.sh) then make_queues.sh and bounce.sh in parallel. The bounce script kills the primary after a random delay, restarts it after a random delay, then waits for full recovery before killing the primary again.

Got 2 cores this morning about 9am, none since. Cores are about 1.5Gb which suggests there were a lot of queues at the time.

Comment 20 Alan Conway 2016-07-15 00:42:17 UTC
Still no reliable reproducer but this fix is consisent with all the cores we have, from customer and reproduced by me.

http://git.app.eng.bos.redhat.com/git/rh-qpid.git/commit/?h=0.34-mrg-aconway-bz1333767-failover&id=f1f1c3cf84caab8ec046564fdce0658e7612da6c

Bug 1333767: Backup crash during re-connect in failover0.34-mrg-aconway-bz1333767-failover
Fixed bug in backup recovery consistent with stack traces from a customer and
similar traces reproduced as explained in
    https://bugzilla.redhat.com/show_bug.cgi?id=1333767#c19.

The bug was in BrokerReplicator: an interrupted failover left the UpdateTracker
in place with a queue list. The next failover created a new UpdateTracker and
deleted the old one, which triggered deletion of queues. This caused re-entrant
deletion of bridges while Link::ioThreadProcessing is iterating over the bridge
list, which could cause the iteration to go past the end of the list and access
invalid memory.

The exact circumstances are hard to reproduce, it requires multiple steps:

1. Backup receives queue creation events and calls to Link::add() with new
bridges, but is disconnected from primary before the Link::ioThreadProcessing()
event. This puts the new bridges on the Link::created list *before* the
BrokerReplicator bridge.

2. Backup re-connects and creates an UpdateTracker with a list of all queues.
Before the update is complete, the backup is disconnected leaving the
UpdateTracker in place.

3. Backup re-connects again and deletes the old UpdateTracker, which deletes all
the queues and their bridges in the Link::created vector. This means the iterator
in Link::ioThreadProcessing now points after the end of the vector.

Step 1 is needed because normally the BrokerReplicator is the first bridge in
the list, and deleting queues after it leaves the iterator inside bounds. There
needs to be at least one deleted queue before the BrokerReplicator in the list.

There are a number of reasons a failover could be interrupted:

- two primaries are killed in quick succession.
- primary disconnects an expected "ready" backup at the ha-backup-timeout.
- a backup misses a heartbeat during failover.

Comment 34 errata-xmlrpc 2016-10-11 07:36:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2049.html