Description of problem: Having qpid broker in a cluster and using message groups, an attempt to join a clustered peer causes cluster stall during initial update process, when some queue has >10k messages with message groups set. The reason is that updater node sends information about message groups in ClusterConnectionQueueObserverStateBody message (exactly one message per one queue). If some queue has "too much" messages with msg.groups, such ClusterConnectionQueueObserverStateBody message does not fit into one AMQP frame and it is silently(!) dropped by the updater. Updatee node then waits for the message while updater node (and consequently whole cluster) waits for updatee to mark itself as ready. Version-Release number of selected component (if applicable): 0.14-21, almost surely in 0.18 How reproducible: 100% Steps to Reproduce: 1. Have 2node cluster with 1 node running 2. Produce at least 10k messages with message groups to it: qpid-send --group-key "GROUP_KEY" -m 10000 -a "groupQ; {create:always, node:{type:queue, x-declare:{ arguments:{'qpid.group_header_key':'GROUP_KEY', 'qpid.shared_msg_group':1 }}}}" 3. (re)start 2nd node twice - due to some unknown reason, the first start succeeds while the second does not. Actual results: New joiner stalls the cluster. Expected results: No broker joining a cluster can stall the cluster. Additional info:
(fyi it is enough to send 6000 messages in above scenario to trigger the bug, while resetting group prefix by --group-prefix "" would cause 6k messages to pass in single ClusterConnectionQueueObserverStateBody message)
Created attachment 616619 [details] Patch proposal Patch proposal. Instead of sending one too-huge-to-encode AMQP message from UpdateClient to update state of MessageGroupManager, more state updates are sent - one per each message group. As a message group consists of few messages only, this approach should not hit the original problem any more. a/src/qpid/cluster/UpdateClient.cpp has to be changed to send potentially more updates by one StatefulQueueObserver. a/src/qpid/broker/QueueFlowLimit.h changed is a direct consequence of that MessageGroupManager::getState and MessageGroupManager::setState in fact does the same as before but without the "for (GroupMap::const_iterator .." loop done from UpdateClient.
Involves clustering impl not present in 2.4/0.22.