Bug 1333767
Summary: | Occasional segfault when creating&deleting many queues just before primary failover | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Pavel Moravec <pmoravec> | ||||
Component: | qpid-cpp | Assignee: | Alan Conway <aconway> | ||||
Status: | CLOSED ERRATA | QA Contact: | Messaging QE <messaging-qe-bugs> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 3.2 | CC: | aconway, jross, mcressma, pmoravec, tkratky, zkraus | ||||
Target Milestone: | 3.2.2 | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | qpid-cpp-0.34-16 | Doc Type: | Bug Fix | ||||
Doc Text: |
Cause: Interrupted recovery on backup left invalid data behind causing memory errors.
Consequence: Backup crash during failover/recovery.
Fix: Fixed clean up if recovery is interrupted. Improved memory safety of some other potential
problems observed during the fix.
Result: No crash.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2016-10-11 07:36:41 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Pavel Moravec
2016-05-06 11:39:02 UTC
Two reproducer scripts: (/root/qpidd.acl allows everything, needed just to allow anonymous links) # cat start_ha_brokers.sh start_broker() { port=$1 shift rm -rf _${port} mkdir _${port} nohup qpidd --port=$port --store-dir=_${port} --log-to-file=qpidd.$port.log --data-dir=_${port} --auth=no --log-to-stderr=no --ha-cluster=yes --ha-brokers-url="localhost:5672,localhost:5673,localhost:5674" --ha-replicate=all --acl-file=/root/qpidd.acl --link-maintenance-interval=10 "$@" > /dev/null 2>&1 & sleep 1 } killall qpidd 2> /dev/null rm -f qpidd.*.log start_broker 5672 sleep 1 qpid-ha promote -b localhost:5672 --cluster-manager sleep 1 start_broker 5673 sleep 1 start_broker 5674 # and # cat ./failover_primary_delete_queues.sh primary=${1:-5672} queues=2000 while true; do date /root/Cpp_MRG/drain_many_consumers -c $queues -t 2 -b localhost:${primary} qqq kill $(ps aux | grep qpidd | grep "port=${primary}" | awk '{ print $2}') port=$primary primary=$(($((5672+5673))-$((primary)))) qpid-ha promote -b localhost:${primary} --cluster-manager sleep 2 qpidd --port=${port} --store-dir=_${port} --log-to-file=qpidd.${port}.log --data-dir=_${port} --auth=no --log-to-stderr=no --ha-cluster=yes --ha-brokers-url=localhost:5672,localhost:5673,localhost:5674 --ha-replicate=all --acl-file=/root/qpidd.acl & #--log-enable=debug+:Bridge --log-enable=notice+ --log-hires-timestamp=yes & sleep 5 done # Brokers on ports 5672 and 5673 are failing over primary, just after drain_many_consumers created 2000 queues and delete them in a batch (and running the same drain_many_consumers in parallel in another terminal) Created attachment 1154591 [details]
drain_many_consumers.cpp
Btw. trying to more reliably reproduce the segfault (so far hit it just once), I often get segfault like: (gdb) bt #0 0x0000003e0c832625 in raise () from /lib64/libc.so.6 #1 0x0000003e0c833e05 in abort () from /lib64/libc.so.6 #2 0x0000003e0f0bea7d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6 #3 0x0000003e0f0bcbd6 in ?? () from /usr/lib64/libstdc++.so.6 #4 0x0000003e0f0bcc03 in std::terminate() () from /usr/lib64/libstdc++.so.6 #5 0x0000003e0f0bd55f in __cxa_pure_virtual () from /usr/lib64/libstdc++.so.6 #6 0x0000003e0f6eed16 in qpid::broker::SessionHandler::handleDetach (this=0x7fb4d4b5de20) at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/SessionHandler.cpp:112 #7 0x0000003e0f6557fc in qpid::broker::amqp_0_10::Connection::closed (this=0x7fb4d4c4c190) at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/amqp_0_10/Connection.cpp:374 #8 0x0000003e125b9345 in qpid::sys::AsynchIOHandler::~AsynchIOHandler (this=0x7fb4d472cd00, __in_chrg=<value optimized out>) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/AsynchIOHandler.cpp:70 #9 0x0000003e125b9439 in qpid::sys::AsynchIOHandler::~AsynchIOHandler (this=0x7fb4d472cd00, __in_chrg=<value optimized out>) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/AsynchIOHandler.cpp:74 #10 0x0000003e125ba940 in qpid::sys::AsynchIOHandler::closedSocket (this=0x7fb4d472cd00, s=...) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/AsynchIOHandler.cpp:196 #11 0x0000003e12536239 in operator() (this=0x7fb4d47bd6d0, h=<value optimized out>) at /usr/include/boost/function/function_template.hpp:1013 #12 qpid::sys::posix::AsynchIO::close (this=0x7fb4d47bd6d0, h=<value optimized out>) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/posix/AsynchIO.cpp:616 #13 0x0000003e12537c9e in qpid::sys::posix::AsynchIO::writeable (this=0x7fb4d47bd6d0, h=...) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/posix/AsynchIO.cpp:575 #14 0x0000003e125be5f3 in boost::function1<void, qpid::sys::DispatchHandle&>::operator() (this=<value optimized out>, a0=<value optimized out>) at /usr/include/boost/function/function_template.hpp:1013 #15 0x0000003e125bcf5e in qpid::sys::DispatchHandle::processEvent (this=0x7fb4d47bd6d8, type=qpid::sys::Poller::WRITABLE) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/DispatchHandle.cpp:287 #16 0x0000003e1255de5d in process (this=0x1fb6bf0) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/Poller.h:131 #17 qpid::sys::Poller::run (this=0x1fb6bf0) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/epoll/EpollPoller.cpp:522 #18 0x0000003e1255213a in qpid::sys::(anonymous namespace)::runRunnable (p=<value optimized out>) at /usr/src/debug/qpid-cpp-0.34/src/qpid/sys/posix/Thread.cpp:35 #19 0x0000003e0d007aa1 in start_thread () from /lib64/libpthread.so.0 #20 0x0000003e0c8e893d in clone () from /lib64/libc.so.6 (In reply to Pavel Moravec from comment #3) > Btw. trying to more reliably reproduce the segfault (so far hit it just > once), I often get segfault like: > > (gdb) bt > #0 0x0000003e0c832625 in raise () from /lib64/libc.so.6 > #1 0x0000003e0c833e05 in abort () from /lib64/libc.so.6 > #2 0x0000003e0f0bea7d in __gnu_cxx::__verbose_terminate_handler() () from > /usr/lib64/libstdc++.so.6 > #3 0x0000003e0f0bcbd6 in ?? () from /usr/lib64/libstdc++.so.6 > #4 0x0000003e0f0bcc03 in std::terminate() () from /usr/lib64/libstdc++.so.6 > #5 0x0000003e0f0bd55f in __cxa_pure_virtual () from > /usr/lib64/libstdc++.so.6 > #6 0x0000003e0f6eed16 in qpid::broker::SessionHandler::handleDetach > (this=0x7fb4d4b5de20) > at /usr/src/debug/qpid-cpp-0.34/src/qpid/broker/SessionHandler.cpp:112 handleDetach calls errorListener->detach(); and errorListener is a shared_ptr to a Bridge, so this crash is consistent with these fixes: f7f133f Bug 1333767 - Use shared_ptr for Link::destroy instead of weak_ptr c322c49 Bug 1333767 - Memory management error in Link/Bridge I also have a fix for the initial stack in this bug report: 664350d Bug 1333437 - Conditional compile mismatch in broker and common libs. None of the fixes are confirmed yet but they correspond well to the stack traces. All are on branch: http://git.app.eng.bos.redhat.com/git/rh-qpid.git/log/?h=0.34-mrg-meteorcomm-aconway Commit comments: commit 664350ddc7c074dd96a5ffd021b1fdf1046c1485 Author: Alan Conway <aconway> Date: Wed May 11 10:18:32 2016 -0400 Bug 1333437 - Conditional compile mismatch in broker and common libs. Removed _IN_QPID_BROKER compile definition: - inconsistently set for libqpidcommon (compiled .cpp files) and libqpidbroker (uses .h) files - The broker has a binary incompatible notion of what is in the library. - It sort-of works by accident: - shared_ptr<T> only contains a T*, the mismatch is effectively doing reinterpret_cast<T*> - plain T* works for op->, but doesn't guarantee no concurrent deletes. - we create and destroy shared_ptr in libraries with _IN_QPID_BROKER set so we get cleanup, and no cores if we're lucky but there is limited protection from races. - only used by management: - I think exposing non-shared ptr GetObject was a feature that never materialized, - if not we need a separate function or class for non-shared-ptr users. commit f7f133f6651a3938654e7e8d14c511c16015f8b5 Author: Alan Conway <aconway> Date: Wed May 11 10:30:31 2016 -0400 Bug 1333767 - Use shared_ptr for Link::destroy instead of weak_ptr The Bridge and Link::ioCallback functions can safely be dropped if it is too late but Link::destroy needs to be called, so use a shared_ptr for that. commit c322c49f97f489b6a32d69790bb7d34dcfb6639e Author: Alan Conway <aconway> Date: Tue May 10 17:33:49 2016 -0400 Bug 1333767 - Memory management error in Link/Bridge Fixed a memory bug consistent with causing the stack in https://bugzilla.redhat.com/show_bug.cgi?id=1333767 and several other customer and internally-generated stack traces. The fault is hard to reproduce so the fix is not definitely confirmed. qpid::broker Link and Bridge use Connection::requestIOProcessing() to register callbacks in the connection thread. They were binding a plain "this" pointer to the callback, but the classes are managed by boost::shared_ptr so if all the shared_ptr were released, the callback could happen on a dangling pointer. This fix uses boost::weak_ptr in the callbacks, so if all shared_ptr instances are released, we don't use the dead pointer. Another speculative fix, worth testing but still don't have a good reproducer. It's a missing piece from the previous fix. http://git.app.eng.bos.redhat.com/git/rh-qpid.git/commit/?h=0.34-mrg-meteorcomm-aconway-2&id=bc179055287cd10fb58de03fecad4df647f075db Bug 1333767 - Added resourceDestroy to Queue::~Queue.0.34-mrg-meteorcomm-aconway-2 Call resourceDestroy() in Queue::~Queue. The previous fix assumed that Queue::destroyed() was always called before Queue::~Queue, stack traces suggets that may not be true. Renamed ManagedObject::CoreObject as ManagedObject::ManageablePtr to clarify its role - it is simply an atomic wrapper to protect access to a Manageable*, not a separate object. Found a definite race if a queue is destroyed in 2 threads concurrently - Queue::destroyed() called without a lock. This fits our theory so far. This can happen to an auto-delete queue if 2 of the following are concurrent: - management command from user (qpid-config etc.) - last consumer unsubscribes (on primary) - auto-delete timer runs out - backup promoted (auto-delete is suppressed on backups, enabled on promote) Working on fix, this might help with a reproducer. (In reply to Alan Conway from comment #10) > Found a definite race if a queue is destroyed in 2 threads concurrently - > Queue::destroyed() called without a lock. This fits our theory so far. > > This can happen to an auto-delete queue if 2 of the following are concurrent: > > - management command from user (qpid-config etc.) > - last consumer unsubscribes (on primary) > - auto-delete timer runs out > - backup promoted (auto-delete is suppressed on backups, enabled on promote) > > Working on fix, this might help with a reproducer. So something like: 1) time to time, relocate primary qpidd 2) run script like: queues=15 for i in $(seq 1 $queues); do while true; do rnd=$(uuidgen) qpid-receive -a "autoDelQueue_${i}_${rnd}; {create:always, node:{ x-declare:{auto-delete:True, alternate-exchange:amq.direct, arguments:{'qpid.auto_delete_timeout':1}}}}" -b "localhost:5672,localhost:5673,localhost:5674" sleep 0.9$(($(($((RANDOM))%5))+5)) for port in 5672 5673 5674; do ~/Cpp_MRG/QMF_examples/reroute_queue autoDelQueue_${i}_${rnd} 0 amq.fanout localhost:${port} 2> /dev/null & done wait done & done that (in parallel) creates autoDel queue and at the time of auto-deleting it, it calls reroute QMF method on the queue. I tried that, no luck :( (In reply to Alan Conway from comment #10) > Found a definite race if a queue is destroyed in 2 threads concurrently - > Queue::destroyed() called without a lock. This fits our theory so far. > > This can happen to an auto-delete queue if 2 of the following are concurrent: > > - management command from user (qpid-config etc.) > - last consumer unsubscribes (on primary) > - auto-delete timer runs out > - backup promoted (auto-delete is suppressed on backups, enabled on promote) Just to clarify: - backup broker gets the QMF command - auto-del timer on the queue expires - now the backup broker is promoted to primary - parallel execution of Queue::destroyed() and qmf::org::apache::qpid::broker::Queue::doMethod() are executed concurrently on just-promoted broker - and this causes the segfault? That should be covered by the reproducer, right? Just a silly idea: the problem comes when qpid::broker::Queue destructor is called while qpid::management::ManagementObject::CoreObject points to it. What about changing: class CoreObject { Manageable* ptr; to: class CoreObject { boost::shared_ptr<Manageable> ptr; to prevent the first destructor call (triggered by boost, I guess) since boost will still keep a reference to the Queue object? Some further observations: 1) the queue double-deleted in customer's coredump is autodelete but _without_ autoDel timeout: gdb -c _core.qpidd-01627293-1 /usr/sbin/qpidd .. (gdb) frame 2 #2 0x000000354a10ac19 in qmf::org::apache::qpid::broker::Queue::doMethod (this=0x7f5b4c090230, methodName=<value optimized out>, inMap=<value optimized out>, outMap=std::map with 0 elements, userId="anonymous") at /usr/src/debug/qpid-cpp-0.34/src/qmf/org/apache/qpid/broker/Queue.cpp:1161 1161 bool allow = coreObject.AuthorizeMethod(METHOD_REROUTE, ioArgs, userId); (gdb) p arguments $1 = std::map with 6 elements = { ["qpid.ha-uuid"] = {impl = 0x7f5b4f8c7140}, ["qpid.max_count"] = {impl = 0x7f5ba16ca990}, ["qpid.max_size"] = {impl = 0x7f5b50eb3d30}, ["qpid.policy_type"] = {impl = 0x7f5b4eaaa300}, ["qpid.replicate"] = {impl = 0x7f5ba0b24fe0}, ["x-qpid-priorities"] = {impl = 0x7f5ba0bfdf40} } (gdb) See no "qpid.auto_delete_timeout" argument there. 2) On the backup broker, the qpid::broker::Queue object - since not with autoDel timeout - must be deleted from BrokerReplicator::doEventQueueDelete where Queue::destroyed and CoreObject::destroy are called first - setting the ptr = 0 and preventing the segfault. So the qpid::broker::Queue object was deleted after the broker was promoted to primary, imho. (In reply to Pavel Moravec from comment #13) > Just a silly idea: > > the problem comes when qpid::broker::Queue destructor is called while > qpid::management::ManagementObject::CoreObject points to it. What about ... > > class CoreObject { > boost::shared_ptr<Manageable> ptr; Not silly, that was my first idea. The problem is Manageable is a base class for all types of manageable objects, and they don't all use shared_ptr. We would have to change them all to be consistent, which I wanted to avoid. CoreObject acts a bit like weak_ptr (friend of shared_ptr) but using the existing resourceDestory() callback. However after you crashed my first patch I found race conditions around calling resourceDestroy as well. I've fixed those, but now you remind me, I think maybe it is worth doing the work to use weak_ptr in case there are more issues. It is simpler and safer because it is self-contained and does not depend on our code correctly calling functions like resourceDestroy() (In reply to Pavel Moravec from comment #14) > So the qpid::broker::Queue object was deleted after the broker was promoted > to primary, imho. My current pet theory is this: an auto-delete queue (does not require timeout) is created, subscribed and un-subscribed. The primary sends un-sub and delete to the backup and then fails. The backup is processing the primary delete when another thread receives the "promote" from the cluster manager. "promote" re-activates auto-delete (which is disabled in backup mode) and therefore destroys the queue (which is auto-del and has been unsubscribed) at the same time as the replication thread is processing the last command from the primary, which also destroys the queue. How does that sound from your evidence? Unfortunately I've been running something all weekend and have not reproduced ;( Pushed a fix for the hypothetical queue deletion race. Still no convincing reproducer. http://git.app.eng.bos.redhat.com/git/rh-qpid.git/commit/?h=0.34-mrg-meteorcomm&id=ddf24b88fb8ad3f2200182ab31ecd70d185abc41 Bug 1333767 - Fix race conditions during Queue destruction.0.34-mrg-meteorcomm Move calls to Queue::destroyed() into QueueRegistry::destroy(). Use QueueRegistry lock to guarantee it exactly one call. Removed some dead code. Call resourceDestroy() in Queue::~Queue if it was not called already. This is probably redundant given given destroyed() fix but can't hurt. Renamed ManagedObject::CoreObject as ManagedObject::ManageablePtr to clarify its role as an atomic pointer wrapper, not a separate object. Fixes to date for this BZ are under review upstream, tracked by https://issues.apache.org/jira/browse/QPID-7306 Created attachment 1178988 [details]
reproducer for core dump during failover.
start the cluster (ha-start.sh) then make_queues.sh and bounce.sh in parallel. The bounce script kills the primary after a random delay, restarts it after a random delay, then waits for full recovery before killing the primary again.
Got 2 cores this morning about 9am, none since. Cores are about 1.5Gb which suggests there were a lot of queues at the time.
Still no reliable reproducer but this fix is consisent with all the cores we have, from customer and reproduced by me. http://git.app.eng.bos.redhat.com/git/rh-qpid.git/commit/?h=0.34-mrg-aconway-bz1333767-failover&id=f1f1c3cf84caab8ec046564fdce0658e7612da6c Bug 1333767: Backup crash during re-connect in failover0.34-mrg-aconway-bz1333767-failover Fixed bug in backup recovery consistent with stack traces from a customer and similar traces reproduced as explained in https://bugzilla.redhat.com/show_bug.cgi?id=1333767#c19. The bug was in BrokerReplicator: an interrupted failover left the UpdateTracker in place with a queue list. The next failover created a new UpdateTracker and deleted the old one, which triggered deletion of queues. This caused re-entrant deletion of bridges while Link::ioThreadProcessing is iterating over the bridge list, which could cause the iteration to go past the end of the list and access invalid memory. The exact circumstances are hard to reproduce, it requires multiple steps: 1. Backup receives queue creation events and calls to Link::add() with new bridges, but is disconnected from primary before the Link::ioThreadProcessing() event. This puts the new bridges on the Link::created list *before* the BrokerReplicator bridge. 2. Backup re-connects and creates an UpdateTracker with a list of all queues. Before the update is complete, the backup is disconnected leaving the UpdateTracker in place. 3. Backup re-connects again and deletes the old UpdateTracker, which deletes all the queues and their bridges in the Link::created vector. This means the iterator in Link::ioThreadProcessing now points after the end of the vector. Step 1 is needed because normally the BrokerReplicator is the first bridge in the list, and deleting queues after it leaves the iterator inside bounds. There needs to be at least one deleted queue before the BrokerReplicator in the list. There are a number of reasons a failover could be interrupted: - two primaries are killed in quick succession. - primary disconnects an expected "ready" backup at the ha-backup-timeout. - a backup misses a heartbeat during failover. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2049.html |