Bug 854666 - cluster initial update stall when a queue has >10k messages with message groups set
cluster initial update stall when a queue has >10k messages with message grou...
Status: NEW
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp (Show other bugs)
2.1
All Linux
high Severity high
: ---
: ---
Assigned To: Alan Conway
MRG Quality Engineering
: Patch
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-09-05 09:56 EDT by Pavel Moravec
Modified: 2015-11-11 04:12 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
Patch proposal (11.65 KB, patch)
2012-09-24 11:26 EDT, Pavel Moravec
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Apache JIRA QPID-4343 None None None 2012-09-24 11:29:55 EDT
Red Hat Knowledge Base (Solution) 960903 None None None Never

  None (edit)
Description Pavel Moravec 2012-09-05 09:56:34 EDT
Description of problem:
Having qpid broker in a cluster and using message groups, an attempt to join a clustered peer causes cluster stall during initial update process, when some queue has >10k messages with message groups set.

The reason is that updater node sends information about message groups in ClusterConnectionQueueObserverStateBody message (exactly one message per one queue). If some queue has "too much" messages with msg.groups, such ClusterConnectionQueueObserverStateBody message does not fit into one AMQP frame and it is silently(!) dropped by the updater.

Updatee node then waits for the message while updater node (and consequently whole cluster) waits for updatee to mark itself as ready.


Version-Release number of selected component (if applicable):
0.14-21, almost surely in 0.18


How reproducible:
100%


Steps to Reproduce:
1. Have 2node cluster with 1 node running
2. Produce at least 10k messages with message groups to it:
qpid-send --group-key "GROUP_KEY" -m 10000 -a "groupQ; {create:always, node:{type:queue, x-declare:{ arguments:{'qpid.group_header_key':'GROUP_KEY', 'qpid.shared_msg_group':1 }}}}"
3. (re)start 2nd node twice - due to some unknown reason, the first start succeeds while the second does not.


Actual results:
New joiner stalls the cluster.


Expected results:
No broker joining a cluster can stall the cluster.


Additional info:
Comment 1 Pavel Moravec 2012-09-05 10:38:46 EDT
(fyi it is enough to send 6000 messages in above scenario to trigger the bug, while resetting group prefix by --group-prefix "" would cause 6k messages to pass in single ClusterConnectionQueueObserverStateBody message)
Comment 3 Pavel Moravec 2012-09-24 11:26:30 EDT
Created attachment 616619 [details]
Patch proposal

Patch proposal.

Instead of sending one too-huge-to-encode AMQP message from UpdateClient to update state of MessageGroupManager, more state updates are sent - one per each message group. As a message group consists of few messages only, this approach should not hit the original problem any more.

a/src/qpid/cluster/UpdateClient.cpp has to be changed to send potentially more updates by one StatefulQueueObserver. 

a/src/qpid/broker/QueueFlowLimit.h changed is a direct consequence of that

MessageGroupManager::getState and MessageGroupManager::setState in fact does the same as before but without the "for (GroupMap::const_iterator .." loop done from UpdateClient.
Comment 4 Justin Ross 2013-02-22 10:51:02 EST
Involves clustering impl not present in 2.4/0.22.

Note You need to log in before you can comment on or make changes to this bug.