860700 – QMF queries for HA replication take too long to process

Bug 860700 - QMF queries for HA replication take too long to process

Summary: QMF queries for HA replication take too long to process

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	Development
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	3.0
Target Release:	---
Assignee:	Alan Conway
QA Contact:	mick
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-09-26 14:09 UTC by Jason Dillaman
Modified:	2015-01-21 12:54 UTC (History)
CC List:	4 users (show)
Fixed In Version:	qpid-cpp-0.22-1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-01-21 12:54:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
reproducer (92.40 KB, application/x-gzip) 2013-08-26 19:10 UTC, mick	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Apache JIRA	QPID-4286	0	None	None	None	2012-09-26 14:09:52 UTC

Description Jason Dillaman 2012-09-26 14:09:52 UTC

Description of problem:
In an HA broker with approximately 12,000 queues, it takes roughly 10-14 seconds for the the first QMF queue query response fragment to arrive. While the QMF management agent is collecting the response, all other QMF-related functionality is blocked -- which will block any thread that raises a QMF event.

Not only will this result in clients getting disconnected from the broker due to worker threads being blocked by QMF (either due to missed heartbeats in an extreme case or from the 2 second handshake timeout), this also results in the HA backup's federated link getting disconnected due to missed heartbeats when the link heartbeat interval is set to a low value.

If the HA backup loses its connection, it only exacerbates the issue since it will reconnect and re-query the QMF data that made it lose its connection in the first place.

Version-Release number of selected component (if applicable):
Qpid 0.18

How reproducible:
Frequently

Steps to Reproduce:
1. Start-up an HA primary and backup broker with a small link heartbeat interval (10 seconds)
2. Connect 6000 clients, each creating 2 replicated queues
3. Request a QMF queue reroute for all queues from a single client

Actual results:
All broker worker threads become temporarily blocked against ManagementAgent::userLock. This will result in the federated link timing out due to missed heartbeats which results in the backup broker reconnecting and re-issuing QMF queries (which will be blocked by userLock) -- this cycle of the backup reconnecting repeats.

Expected results:
QMF can gracefully handle the large number of events and requests.

Comment 2 mick 2013-08-05 18:16:27 UTC

Jason -- Do you still have some code sitting around that I could use to see exactly what you did?   Especially for steps 2 & 3 of your "steps to reproduce".

Comment 3 Jason Dillaman 2013-08-05 18:23:55 UTC

I checked and unfortunately I do not have any reproducer available for this issue.  I encountered this issue indirectly while testing a heavily loaded HA broker (in terms of the number of connected clients and number of queues) and we no longer have access to the testing environment where we originally witnessed the issue.

Comment 4 mick 2013-08-26 19:10:32 UTC

Created attachment 790652 [details]
reproducer

One script, two cpp programs, and a README.

Comment 5 mick 2013-08-26 19:18:09 UTC

Please see my attachment for a big, mean reproducer.
And if you want to try it, look at the README first -- you might need to find a bigger box to run it on.

Verified !

Since this BZ was reported against development, I used the same sets of packages both for the before-fix and after-fix tests. ( See package list below. )
Only the qpid source tree changed. The qpidd broker I used in the before-fix tests was from that source-build ( svn version r1398529 ). Also, my clients were built against libraries from that source-build.

In the after-fix tests I used qpidd from the installed packages below, and built my clients against the installed libraries. )

The time measured before the fix was 240 seconds.
Time measured after the fix was 1.58 seconds.

--> verified.

package list cyrus-sasl-2.1.23-13.el6_3.1.x86_64
cyrus-sasl-devel-2.1.23-13.el6_3.1.x86_64
cyrus-sasl-gssapi-2.1.23-13.el6_3.1.x86_64
cyrus-sasl-lib-2.1.23-13.el6_3.1.x86_64
cyrus-sasl-md5-2.1.23-13.el6_3.1.x86_64
cyrus-sasl-plain-2.1.23-13.el6_3.1.x86_64
python-qpid-0.22-4.el6.noarch
python-qpid-qmf-0.22-7.el6.x86_64
python-saslwrapper-0.22-3.el6.x86_64
qpid-cpp-client-0.22-8.el6.x86_64
qpid-cpp-client-devel-0.22-8.el6.x86_64
qpid-cpp-client-devel-docs-0.22-8.el6.noarch
qpid-cpp-client-rdma-0.22-8.el6.x86_64
qpid-cpp-client-ssl-0.22-8.el6.x86_64
qpid-cpp-debuginfo-0.22-8.el6.x86_64
qpid-cpp-server-0.22-8.el6.x86_64
qpid-cpp-server-devel-0.22-8.el6.x86_64
qpid-cpp-server-ha-0.22-8.el6.x86_64
qpid-cpp-server-rdma-0.22-8.el6.x86_64
qpid-cpp-server-ssl-0.22-8.el6.x86_64
qpid-cpp-server-store-0.22-8.el6.x86_64
qpid-cpp-server-xml-0.22-8.el6.x86_64
qpid-cpp-tar-0.22-8.el6.noarch
qpid-java-client-0.22-5.el6.noarch
qpid-java-common-0.22-5.el6.noarch
qpid-java-example-0.22-5.el6.noarch
qpid-proton-c-0.4-2.2.el6.x86_64
qpid-proton-c-devel-0.4-2.2.el6.x86_64
qpid-proton-debuginfo-0.4-2.2.el6.x86_64
qpid-qmf-0.22-7.el6.x86_64
qpid-qmf-debuginfo-0.22-7.el6.x86_64
qpid-qmf-devel-0.22-7.el6.x86_64
qpid-snmpd-1.0.0-12.el6.x86_64
qpid-snmpd-debuginfo-1.0.0-12.el6.x86_64
qpid-tests-0.22-4.el6.noarch
qpid-tools-0.22-3.el6.noarch
saslwrapper-0.22-3.el6.x86_64
saslwrapper-devel-0.22-3.el6.x86_64

Note You need to log in before you can comment on or make changes to this bug.