Bug 1564132 - invalid bridges can be detached in a stress test, resulting into infinite ping-pong of session.detach
Summary: invalid bridges can be detached in a stress test, resulting into infinite pin...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 2.5
Hardware: x86_64
OS: Linux
high
high
Target Milestone: 3.2.10
: ---
Assignee: messaging-bugs
QA Contact: Zdenek Kraus
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-05 12:37 UTC by Pavel Moravec
Modified: 2018-05-10 00:42 UTC (History)
5 users (show)

Fixed In Version: qpid-cpp-1.36.0-15
Doc Type: Bug Fix
Doc Text:
Cause: When routes between brokers were deleted, a software bug sometimes allowed other open routes to be closed by mistake. Consequence: This could cause unexpected behavior, including possibly crashing the broker. Fix: The code to track which routes to close has been fixed. Result: Now we do not get unexpected route closings and crashes. (everything works correctly)
Clone Of:
Environment:
Last Closed: 2018-05-07 14:57:15 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Apache JIRA QPID-7991 0 None None None 2018-04-09 14:12:09 UTC
Red Hat Product Errata RHBA-2018:1334 0 None None None 2018-05-07 14:57:18 UTC

Description Pavel Moravec 2018-04-05 12:37:55 UTC
Description of problem:
In a stress test with many bridges detaches, random valid bridges are also dropped and solely infinite session.detach ping-pong can happen. The cause is that:

https://github.com/apache/qpid-cpp/blob/master/src/qpid/broker/Link.cpp#L469-L471

uses std::remove_if that does _not_ guarantee content of the vector in [removed,end) range. Particular example - if following Example from

http://www.cplusplus.com/reference/algorithm/remove_if/

then:
 [begin,removed) contains 2 4 6 8
 [removed,end) contains 5 6 7 8 9

I.e. quite random items "marked to remove" - and Link.cpp then drops those bridges, dropping not-detached brdges and not dropping detached bridges.

In some race condition scenario (when both brokers do the same at the same time?), this triggers infinite ping-pong of session detach between the brokers.

we should use std::partition (and reverse condition) instead.

(kudos to the customer who found this)


Version-Release number of selected component (if applicable):
any, incl. upstream


How reproducible:
100% (but unknown ATM)


Steps to Reproduce:
1. ??? I guess creating many bridges and triggering many bursts of detachments of them.


Actual results:
- random (even stable) bridges are dropped
- rarely infinite ping-pong of session.detach


Expected results:
neither of above


Additional info:

Comment 1 Mike Cressman 2018-04-06 17:45:48 UTC
Proposed patch:


diff --git a/qpid/cpp/src/qpid/broker/Link.cpp b/qpid/cpp/src/qpid/broker/Link.cpp
index ed4c54e..2ad8f18 100644
--- a/qpid/cpp/src/qpid/broker/Link.cpp
+++ b/qpid/cpp/src/qpid/broker/Link.cpp
@@ -489,8 +489,9 @@ void Link::ioThreadProcessing()
 
     // check for bridge session errors and recover
     if (!active.empty()) {
-        Bridges::iterator removed = std::remove_if(
-            active.begin(), active.end(), boost::bind(&Bridge::isDetached, _1));
+        // Put the detached bridges at the back of the list
+        Bridges::iterator removed = std::partition(
+            active.begin(), active.end(), !boost::bind(&Bridge::isDetached, _1));
         for (Bridges::iterator i = removed; i != active.end(); ++i) {
             Bridge::shared_ptr  bridge = *i;
             bridge->closed();


This is based on the 0.18-mrg downstream branch.  The code is similar in newer branches, but not sure if the patch applies cleanly there.

Comment 3 Alan Conway 2018-04-06 19:03:33 UTC
This patch is good. It's equivalent to the code on the latest master branch.

(Actually this patch is slightly more efficient, master does 2 scans of the entire list, whereas this does one complete scan and one of just the detached bridges)

Comment 4 Mike Cressman 2018-04-06 22:03:10 UTC
This was fixed upstream previously via QPID-7991, as part of the 1.37.0 release.
I will do a hotfix build for MRG 2 (0.18-based), and we will use this BZ to backport it for the next MRG 3 release (1.36.0-based).

Comment 5 Pavel Moravec 2018-04-09 07:16:10 UTC
Reproducer for MRG 2 / qpid 0.18: and MRG 3 / qpid 1.36:

- start 2 brokers on localhost, listening on ports 5672 and 5673, with default link-maintenance-interval
 
- create more (autodelete) queues, let them have consumer via federation, and then - in a bulk - forcefully delete some queues. Also some _different_ queues will be removed as an outcome (what's wrong):


maxQueue=9
maxDelQueue=4

# create auto-del queues
for type in a b; do
    for i in $(seq 0 $maxQueue); do 
        qpid-send -a "${type}_queue_$i; {create:always, node:{ x-declare:{auto-delete:True}}}"
    done
done

# create federation routes from all queues on 5672 broker
for type in a b; do
    for i in $(seq 0 $maxQueue); do
        qpid-route queue add $(hostname):5673 $(hostname):5672 amq.direct ${type}_queue_${i} &
    done
done
while [ $(ps aux | grep -v grep | grep -c qpid-route) -gt 0 ]; do sleep 1; done

# check all queues exist
qpid-stat -q

# delete some a_* queues
for i in $(seq 0 $maxDelQueue); do qpid-receive -a "a_queue_$i; {delete:always}" & done
wait

# check just those queues were deleted
qpid-stat -q

# wait for link maintenance interval and check if link maintenance dropped some other sessions causing some other queues are deleted
sleep 3
qpid-stat -q


See that the latest qpid-stat differs from previous one - some queues are newly missing that we hadnt touched - meantime just link maintenance internal procedure happened.


reproducible on any 0.18 and 1.36 qpid version, not reproducible in upstream (QPID-7991 seems ot fix this though still using the problematic-in-this-context remove_if there).

Comment 6 Alan Conway 2018-04-09 13:55:10 UTC
(In reply to Pavel Moravec from comment #5)

> reproducible on any 0.18 and 1.36 qpid version, not reproducible in upstream
> (QPID-7991 seems ot fix this though still using the
> problematic-in-this-context remove_if there).

remove_if is not the problem in itself, but it was being used incorrectly.

Upstream was fixed to use remove_if correctly - it scans the entire list to process deleted bridges *before* doing remove_if to remove them. 

The fix in comment 1 using std::partition is also correct, std::partition keeps all the elements of the original list and just re-orders them, whereas std::remove_if makes no guarantees about what will be in the "removed" section of the list. In theory std::remove_if should be a bit faster, but the solution with std::partition doesn't have to scan the entire list twice, so it's hard to say which is more efficient. Either is a good fix.

Comment 13 Zdenek Kraus 2018-05-03 11:50:26 UTC
Issue were tested with:
qpid-cpp-1.36.0-15.el6
qpid-cpp-1.36.0-16.el7

Fix works as expected.

-> VERIFIED

Comment 16 errata-xmlrpc 2018-05-07 14:57:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1334


Note You need to log in before you can comment on or make changes to this bug.