Description of problem: In a stress test with many bridges detaches, random valid bridges are also dropped and solely infinite session.detach ping-pong can happen. The cause is that: https://github.com/apache/qpid-cpp/blob/master/src/qpid/broker/Link.cpp#L469-L471 uses std::remove_if that does _not_ guarantee content of the vector in [removed,end) range. Particular example - if following Example from http://www.cplusplus.com/reference/algorithm/remove_if/ then: [begin,removed) contains 2 4 6 8 [removed,end) contains 5 6 7 8 9 I.e. quite random items "marked to remove" - and Link.cpp then drops those bridges, dropping not-detached brdges and not dropping detached bridges. In some race condition scenario (when both brokers do the same at the same time?), this triggers infinite ping-pong of session detach between the brokers. we should use std::partition (and reverse condition) instead. (kudos to the customer who found this) Version-Release number of selected component (if applicable): any, incl. upstream How reproducible: 100% (but unknown ATM) Steps to Reproduce: 1. ??? I guess creating many bridges and triggering many bursts of detachments of them. Actual results: - random (even stable) bridges are dropped - rarely infinite ping-pong of session.detach Expected results: neither of above Additional info:
Proposed patch: diff --git a/qpid/cpp/src/qpid/broker/Link.cpp b/qpid/cpp/src/qpid/broker/Link.cpp index ed4c54e..2ad8f18 100644 --- a/qpid/cpp/src/qpid/broker/Link.cpp +++ b/qpid/cpp/src/qpid/broker/Link.cpp @@ -489,8 +489,9 @@ void Link::ioThreadProcessing() // check for bridge session errors and recover if (!active.empty()) { - Bridges::iterator removed = std::remove_if( - active.begin(), active.end(), boost::bind(&Bridge::isDetached, _1)); + // Put the detached bridges at the back of the list + Bridges::iterator removed = std::partition( + active.begin(), active.end(), !boost::bind(&Bridge::isDetached, _1)); for (Bridges::iterator i = removed; i != active.end(); ++i) { Bridge::shared_ptr bridge = *i; bridge->closed(); This is based on the 0.18-mrg downstream branch. The code is similar in newer branches, but not sure if the patch applies cleanly there.
This patch is good. It's equivalent to the code on the latest master branch. (Actually this patch is slightly more efficient, master does 2 scans of the entire list, whereas this does one complete scan and one of just the detached bridges)
This was fixed upstream previously via QPID-7991, as part of the 1.37.0 release. I will do a hotfix build for MRG 2 (0.18-based), and we will use this BZ to backport it for the next MRG 3 release (1.36.0-based).
Reproducer for MRG 2 / qpid 0.18: and MRG 3 / qpid 1.36: - start 2 brokers on localhost, listening on ports 5672 and 5673, with default link-maintenance-interval - create more (autodelete) queues, let them have consumer via federation, and then - in a bulk - forcefully delete some queues. Also some _different_ queues will be removed as an outcome (what's wrong): maxQueue=9 maxDelQueue=4 # create auto-del queues for type in a b; do for i in $(seq 0 $maxQueue); do qpid-send -a "${type}_queue_$i; {create:always, node:{ x-declare:{auto-delete:True}}}" done done # create federation routes from all queues on 5672 broker for type in a b; do for i in $(seq 0 $maxQueue); do qpid-route queue add $(hostname):5673 $(hostname):5672 amq.direct ${type}_queue_${i} & done done while [ $(ps aux | grep -v grep | grep -c qpid-route) -gt 0 ]; do sleep 1; done # check all queues exist qpid-stat -q # delete some a_* queues for i in $(seq 0 $maxDelQueue); do qpid-receive -a "a_queue_$i; {delete:always}" & done wait # check just those queues were deleted qpid-stat -q # wait for link maintenance interval and check if link maintenance dropped some other sessions causing some other queues are deleted sleep 3 qpid-stat -q See that the latest qpid-stat differs from previous one - some queues are newly missing that we hadnt touched - meantime just link maintenance internal procedure happened. reproducible on any 0.18 and 1.36 qpid version, not reproducible in upstream (QPID-7991 seems ot fix this though still using the problematic-in-this-context remove_if there).
(In reply to Pavel Moravec from comment #5) > reproducible on any 0.18 and 1.36 qpid version, not reproducible in upstream > (QPID-7991 seems ot fix this though still using the > problematic-in-this-context remove_if there). remove_if is not the problem in itself, but it was being used incorrectly. Upstream was fixed to use remove_if correctly - it scans the entire list to process deleted bridges *before* doing remove_if to remove them. The fix in comment 1 using std::partition is also correct, std::partition keeps all the elements of the original list and just re-orders them, whereas std::remove_if makes no guarantees about what will be in the "removed" section of the list. In theory std::remove_if should be a bit faster, but the solution with std::partition doesn't have to scan the entire list twice, so it's hard to say which is more efficient. Either is a good fix.
Issue were tested with: qpid-cpp-1.36.0-15.el6 qpid-cpp-1.36.0-16.el7 Fix works as expected. -> VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1334