Bug 854666 - cluster initial update stall when a queue has >10k messages with message groups set
Summary: cluster initial update stall when a queue has >10k messages with message grou...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 2.1
Hardware: All
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: messaging-bugs
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-09-05 13:56 UTC by Pavel Moravec
Modified: 2024-01-19 19:11 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Patch proposal (11.65 KB, patch)
2012-09-24 15:26 UTC, Pavel Moravec
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Apache JIRA QPID-4343 0 None None None 2012-09-24 15:29:55 UTC
Red Hat Knowledge Base (Solution) 960903 0 None None None Never

Description Pavel Moravec 2012-09-05 13:56:34 UTC
Description of problem:
Having qpid broker in a cluster and using message groups, an attempt to join a clustered peer causes cluster stall during initial update process, when some queue has >10k messages with message groups set.

The reason is that updater node sends information about message groups in ClusterConnectionQueueObserverStateBody message (exactly one message per one queue). If some queue has "too much" messages with msg.groups, such ClusterConnectionQueueObserverStateBody message does not fit into one AMQP frame and it is silently(!) dropped by the updater.

Updatee node then waits for the message while updater node (and consequently whole cluster) waits for updatee to mark itself as ready.


Version-Release number of selected component (if applicable):
0.14-21, almost surely in 0.18


How reproducible:
100%


Steps to Reproduce:
1. Have 2node cluster with 1 node running
2. Produce at least 10k messages with message groups to it:
qpid-send --group-key "GROUP_KEY" -m 10000 -a "groupQ; {create:always, node:{type:queue, x-declare:{ arguments:{'qpid.group_header_key':'GROUP_KEY', 'qpid.shared_msg_group':1 }}}}"
3. (re)start 2nd node twice - due to some unknown reason, the first start succeeds while the second does not.


Actual results:
New joiner stalls the cluster.


Expected results:
No broker joining a cluster can stall the cluster.


Additional info:

Comment 1 Pavel Moravec 2012-09-05 14:38:46 UTC
(fyi it is enough to send 6000 messages in above scenario to trigger the bug, while resetting group prefix by --group-prefix "" would cause 6k messages to pass in single ClusterConnectionQueueObserverStateBody message)

Comment 3 Pavel Moravec 2012-09-24 15:26:30 UTC
Created attachment 616619 [details]
Patch proposal

Patch proposal.

Instead of sending one too-huge-to-encode AMQP message from UpdateClient to update state of MessageGroupManager, more state updates are sent - one per each message group. As a message group consists of few messages only, this approach should not hit the original problem any more.

a/src/qpid/cluster/UpdateClient.cpp has to be changed to send potentially more updates by one StatefulQueueObserver. 

a/src/qpid/broker/QueueFlowLimit.h changed is a direct consequence of that

MessageGroupManager::getState and MessageGroupManager::setState in fact does the same as before but without the "for (GroupMap::const_iterator .." loop done from UpdateClient.

Comment 4 Justin Ross 2013-02-22 15:51:02 UTC
Involves clustering impl not present in 2.4/0.22.


Note You need to log in before you can comment on or make changes to this bug.