Bug 860701 - QMF queries for HA replication take too long to process
Summary: QMF queries for HA replication take too long to process
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: Development
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: 2.3
: ---
Assignee: Alan Conway
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
: 860413 (view as bug list)
Depends On:
Blocks: 698367
TreeView+ depends on / blocked
 
Reported: 2012-09-26 14:10 UTC by Jason Dillaman
Modified: 2013-03-19 16:39 UTC (History)
6 users (show)

Fixed In Version: qpid-cpp-0.18-4
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-03-19 16:39:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Quick patch to greatly reduce lock contention within QMF (120.62 KB, patch)
2012-09-26 14:31 UTC, Jason Dillaman
no flags Details | Diff
Additional fixes to Jason's patch (131.76 KB, patch)
2012-10-01 11:04 UTC, Alan Conway
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Apache JIRA QPID-4286 0 None None None 2012-09-26 14:10:03 UTC

Description Jason Dillaman 2012-09-26 14:10:03 UTC
Description of problem:
In an HA broker with approximately 12,000 queues, it takes roughly 10-14 seconds for the the first QMF queue query response fragment to arrive.  While the QMF management agent is collecting the response, all other QMF-related functionality is blocked  -- which will block any thread that raises a QMF event.  

Not only will this result in clients getting disconnected from the broker due to worker threads being blocked by QMF (either due to missed heartbeats in an extreme case or from the 2 second handshake timeout), this also results in the HA backup's federated link getting disconnected due to missed heartbeats when the link heartbeat interval is set to a low value.  

If the HA backup loses its connection, it only exacerbates the issue since it will reconnect and re-query the QMF data that made it lose its connection in the first place.  

Version-Release number of selected component (if applicable):
Qpid 0.18

How reproducible:
Frequently

Steps to Reproduce:
1. Start-up an HA primary and backup broker with a small link heartbeat interval (10 seconds)
2. Connect 6000 clients, each creating 2 replicated queues
3. Request a QMF queue reroute for all queues from a single client

Actual results:
All broker worker threads become temporarily blocked against ManagementAgent::userLock.  This will result in the federated link timing out due to missed heartbeats which results in the backup broker reconnecting and re-issuing QMF queries (which will be blocked by userLock) -- this cycle of the backup reconnecting repeats.

Expected results:
QMF can gracefully handle the large number of events and requests.

Comment 1 Jason Dillaman 2012-09-26 14:31:57 UTC
Created attachment 617570 [details]
Quick patch to greatly reduce lock contention within QMF

Comment 2 Alan Conway 2012-10-01 11:04:12 UTC
Created attachment 619735 [details]
Additional fixes to Jason's patch

This patch is Jason's patch plus some extra fixes that I think are necessary.


Note You need to log in before you can comment on or make changes to this bug.