Red Hat Bugzilla – Bug 507388
Cluster becomes unresponsive after an error.
Last modified: 2009-07-14 13:32:07 EDT
Description of problem:
If a busy producer session is disconnected due to an error, the cluster becomes unresponsive for a long period of time (up to half an hour reported)
Version-Release number of selected component (if applicable):
This has been observed in -19, -17 and unpatched r752581 builds.
Steps to Reproduce:
1. Run cluster with message store.
2. perftest --durable=true --size=8
After a few seconds the client reports a queue capacity error. The brokers repeatedly log a "session-not-attached" error in their logs and are unresponsive.
The brokers should handle the error quickly and remain responsive.
It looks like the cluster's flow control mechanism is not working and thousands of frames are being queued up prior to the error. When the session is detached by the queue capacity error, the brokers have to work thru the backlog of frames rejecting each one.
Inserting log statements in cluster::Connection::decode and giveReadCredit shows that there are long runs of (up to 50) of 64k reads being made with no credit.
Created attachment 348952 [details]
Patch to address the issue by reducing log output
The broker unresponsiveness is caused by the cost of logging a non-attached error for each frame. This patch makes the broker log only the first not-attached error and resolves the observed problem.
There still appears to be an underlying issue with broker flow control, see bug 507421
Fixed in qpidd-0.5.752581-20.
Reproduced on qpidd-0.5.752581-17.el5 / RHEL5 i386
Verified on qpidd-0.5.752581-22.el5, RHEL5 i386 and x86_64
Versions of related packages:
qpid* is 0.5.752581-22
Created attachment 350430 [details]
contains merely what is already written above
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.