Bug 771169 - cluster de-sync due to QMF console connection
Summary: cluster de-sync due to QMF console connection
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 2.1
Hardware: All
OS: Linux
high
medium
Target Milestone: 2.2
: ---
Assignee: mick
QA Contact: Frantisek Reznicek
URL:
Whiteboard:
Depends On:
Blocks: 785156 840059
TreeView+ depends on / blocked
 
Reported: 2012-01-02 09:05 UTC by Pavel Moravec
Modified: 2018-11-30 22:43 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 840059 (view as bug list)
Environment:
Last Closed: 2012-07-25 18:12:57 UTC
Target Upstream Version:


Attachments (Terms of Use)
Scripts for replication (526 bytes, application/x-gzip)
2012-01-02 09:05 UTC, Pavel Moravec
no flags Details
trace logs with some notes (881.03 KB, application/x-gzip)
2012-06-04 15:25 UTC, Pavel Moravec
no flags Details

Description Pavel Moravec 2012-01-02 09:05:04 UTC
Description of problem:
Having a two node cluster where qpid-tool (or any equivalent subscribed to QMF management updates) runs forever and the cluster has lots of events (i.e. queue reached its capacity), one node meets critical error "confirmed < (5183+0) but only sent < (5182+0)" after some time.


Version-Release number of selected component (if applicable):
qpid version 0.12-6


How reproducible:
60% within one hour


Steps to Reproduce:
1. Have 2 node cluster (with nodes named train1 and train2 in my case), start it with empty store. 
2. Provision some queues to have QMF updates bigger - 100 queues each bound to amq.direct exchange:
for i in $(seq 100 200); do ./spout -b train1:5672 "test.${i}.queue; {create:always, node:{type:queue, x-bindings:[{ exchange:amq.direct, queue:test.${i}.queue, key:test.${i}.queue }]}}"; done
3. Start qpid-tool forever on both brokers - to do it automatically, use the attached expect script and run it:
nohup expect expect_script_for_infinite_qpid-tool &
4. Add ten ring queues: 
for i in $(seq 1 10); do qpid-config add queue TinyRingQueue${i} --max-queue-size=100 --max-queue-count=100 --limit-policy=ring -a train1:5672; done
5. Run the attached script for each ring queue to send+drain messages to them:
./send_drain_all_TinyQueues.sh
The script invokes send_drain_TinyQueue.sh script such that for all 10 queues, it periodically sends a random number of messages to (each of the) queue and drain 1 message. The key point is the ring queue will reach its limit in almost every loop of the script.
6. Wait until some node shuts down.


Actual results:
With 60% probability within 1 hour, a node will shut down with log:
2012-01-01 23:12:27 critical cluster(10.34.1.187:15940 READY/error) local error 2083360 did not occur on member 10.34.1.197:12769: invalid-argument: anonymous.pmoravec-mrg-train2.13188.1: confirmed < (11476+0) but only sent < (11475+0) (qpid/SessionState.cpp:154)


Expected results:
No such shotdown occurs.


Additional info:
(*) Sending messages to full ring queue seems to be the key point here. When no policy exceeding occurs (no log "Queue message count exceeded policy for" when info+ logging enabled), the problem does not appear.

(*) Time to time, monitor qpid-tool memory consumption - it will be growing due to memorizing old (alrady deleted) objects like connecteons or sessions. You can restart the qpid-tool / expect script time to time to prevent OOM doing the same.

(*) For QE: I can provide a script that repeatedly runs test scenario reproducing this bug and monitoring if qpid died or not - let me ask.

Comment 1 Pavel Moravec 2012-01-02 09:05:59 UTC
Created attachment 550215 [details]
Scripts for replication

Comment 2 Pavel Moravec 2012-01-02 14:40:38 UTC
An idea of root cause in source code:
- the bug appears sometimes as a race condition after raising ResourceLimitExceededException exception (on a ring queue but maybe not relevant to ring only).
- the bug symptom is the node where is only replying activity on the qpid-tool connection receives ACK for a message before the message itself
- isn't the cause wrong synchronization in multicasting the event (of ResourceLimitExceededException exception) that could come later then multicasting response from qpid-tool?

Comment 3 Pavel Moravec 2012-05-25 19:03:58 UTC
Probably caused by bz 786418 "QMF errors ignored by cluster, causing cluster de-sync".

When one remove qpid-tool from the scenario, no desync occurs.

Comment 5 Pavel Moravec 2012-06-04 10:42:10 UTC
Weird when the error occurs, comparing logs & sources:

2012-06-04 09:30:32 trace cluster(10.34.1.197:5638 READY) DLVR 1355681:  Frame[BEbe; channel=1; {SessionCompletedBody: commands={ [0,74522] }; }] data 10.34.1.187:14530-309 read-credit=1
2012-06-04 09:30:32 debug anonymous.pmoravec-mrg-train1.16085.1: sender marked completed: { [0,74522] }
2012-06-04 09:30:32 debug anonymous.pmoravec-mrg-train1.16085.1: sender confirmed point moved to (74523+0)
2012-06-04 09:30:32 trace cluster(10.34.1.197:5638 READY) DLVR 1355682:  Frame[BEbe; channel=1; {SessionCompletedBody: commands={ [0,74523] }; }] data 10.34.1.187:14530-309 read-credit=1
2012-06-04 09:30:32 debug anonymous.pmoravec-mrg-train1.16085.1: sender marked completed: { [0,74523] }
2012-06-04 09:30:32 debug Exception constructed: anonymous.pmoravec-mrg-train1.16085.1: confirmed < (74524+0) but only sent < (74523+0) (qpid/SessionState.cpp:154)

So the error occurs between "sender marked completed" and "sender confirmed point moved to". Sources have:

void SessionState::senderCompleted(const SequenceSet& commands) {
    if (commands.empty()) return;
    QPID_LOG(debug, getId() << ": sender marked completed: " << commands);
    sender.incomplete -= commands;
    // Completion implies confirmation but we don't handle out-of-order
    // confirmation, so confirm up to the end of the first contiguous range of commands.
    senderConfirmed(SessionPoint(commands.rangesBegin()->end()));
}

void SessionState::senderConfirmed(const SessionPoint& confirmed) {
    if (confirmed > sender.sendPoint)
        throw InvalidArgumentException(QPID_MSG(getId() << ": confirmed < " << confirmed << " but only sent < " << sender.sendPoint));
    QPID_LOG(debug, getId() << ": sender confirmed point moved to " << confirmed);
..


So, 'commands' in senderCompleted begin are [0,74523] but commands.rangesBegin()->end() is at the same method at the end "already" (74524+0).

Replicated also with worker-threads=1 so no thread missing lock is behind the bug.


Further, I don't think bz 786418 is relevant, as from tcpdumps it seems like normal message sent to QMF listener (notifying about a client (dis)connection event) is the conflicting one.

Comment 6 Pavel Moravec 2012-06-04 15:25:09 UTC
Created attachment 589192 [details]
trace logs with some notes

Attached are trace logs with my comments / observation:

"confirmed < (82395+0) but only sent < (82394+0)" error occurred on train2 node. As the node has not got an update from train1 about one QMF event sent from train1 to qpid-tool connected to train1.

The missing update was some time in past:

2012-06-04 13:22:57 debug Acquired message at 12818 from TinyRingQueue25

occurring on both nodes notes a message to a full ring queue has been acquired. While on train1 (where the message was really sent to and where qpid-tool was connected to), it caused:

debug SEND raiseEvent (v1) class=org.apache.qpid.broker.queueThresholdExceeded
debug SEND raiseEvent (v2) class=org.apache.qpid.broker.queueThresholdExceeded

But not on train2. These events were sent to qpid-tool from train1:

2012-06-04 13:22:57 trace anonymous.pmoravec-mrg-train1.16341.1: sent cmd 82295: header (88 bytes); properties={{MessageProperties: content-length=111; }{DeliveryProperties: routing-key=console.event.1.0.org.apache.qpid.broker.queueThresholdExceeded; }}

But the same cmd on train2 replaying is completely different (and in fact the above message hasn't been replayed at all). This message is then missing when train2 receives ACK to interval bigger than the interval of sent messages is.


Why train2 didn't raised QMF events??? As previous instances of the same event are re-played properly (see e.g. "2012-06-04 13:22:57 debug Acquired message at 12476 from TinyRingQueue2" in logs).

Simply said, under some unknown situation, the node replaying a message removal (from a ring-queue due to the queue policy) does not raise the relevant QMF event.

Comment 8 mick 2012-07-12 14:33:41 UTC
    executive summary
    -----------------
    we know root cause, and at least two easy workarounds, but an actual fix could be serious highly-invasive work, and will require a policy decision.



    root cause
    ----------
    ( thanks, gsim )

    In src/qpid/broker/ThresholdAlerts.cpp there is a quantity called qpid.alert_repeat_gap.  Its current default value is 60 seconds.  When some threshold is exceeded, like when a TinyRingBuffer gets too full, this quantity prevents a storm of redundant warning messages from being generated.

    But this is a timer event that is independent of the CPG event-stream, and can thus occur at different points in that stream for two different clustered brokers.  Which is Bad.

    One broker got this event just *before* it decided to generate a warning message to qpid-tool, and one got it just *after*.  Boom.



    possible workaround
    -------------------
    disable qpid.alert_repeat_gap in clustered brokers. (perhaps also make new mechanism to only send 1 warning message in case of exceeded threshold? )



    possible fix
    ------------
    in clustered brokers, send the alert_repeat_gap event out through CPG - do not act on it directly.

Comment 9 mick 2012-07-12 14:49:21 UTC
sorry , forgot to add this to the comment above:


evidence
---------

Before making any change to the code, on my system I used pmoravec's reproducer to reproduce the bug 5 times.  I checked 3 of them closely, and the mechanism was the same every time.

The average time-to-failure on my system was 39 minutes, with a standard deviation of 17 minutes.  

After making the code change  (default qpid.alert_repeat_gap to 0) my first test has not failed yet after more than 4 hours.

That is the mean-time-to-failure from my previous tests, plus 12 standard deviations.  Odds of this happening by chance are effectively zero.  ( something like 1 in 10^33 )

Comment 10 mick 2012-07-13 15:03:09 UTC
If we want to fix this, the right way to do it is to use Alan's ClusterTimer to tell the broker when it is OK to issue another warning.  That will force this event into the CPG event stream, and guarantee that the event is handled consistently across all brokers in the cluster.

Otherwise the qpid.alert_repeat_gap effectively introduces a timer event that is not in the CPG event stream, and thus may be executed at different CPG "times" in both brokers.

Comment 11 mick 2012-07-25 18:12:57 UTC
Closing this as "won't fix" because we decided to just release-note it, and wait for new HA.

This is replaced by BZ 840059.


Note You need to log in before you can comment on or make changes to this bug.