Bug 507388 - Cluster becomes unresponsive after an error.
Cluster becomes unresponsive after an error.
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp (Show other bugs)
1.1.1
All Linux
high Severity high
: 1.1.6
: ---
Assigned To: Alan Conway
Jan Sarenik
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-06-22 11:43 EDT by Alan Conway
Modified: 2009-07-14 13:32 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-07-14 13:32:07 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Patch to address the issue by reducing log output (6.12 KB, patch)
2009-06-22 13:53 EDT, Alan Conway
no flags Details | Diff
reproducer script (224 bytes, application/x-sh)
2009-07-03 09:17 EDT, Jan Sarenik
no flags Details

  None (edit)
Description Alan Conway 2009-06-22 11:43:04 EDT
Description of problem:

If a busy producer session is disconnected due to an error, the cluster  becomes unresponsive for a long period of time (up to half an hour reported) 

Version-Release number of selected component (if applicable):

This has been observed in -19, -17 and unpatched r752581 builds.

How reproducible:

Easy

Steps to Reproduce:
1. Run cluster with message store.
2.  perftest --durable=true --size=8
  
Actual results:

After a few seconds the client reports a queue capacity error. The brokers repeatedly log a "session-not-attached" error in their logs and are  unresponsive.

Expected results:

The brokers should handle the error quickly and remain responsive.

Additional info:

It looks like the cluster's flow control mechanism is not working and thousands of frames are being queued up prior to the error. When the session is detached by the queue capacity error, the brokers have to work thru the backlog of frames rejecting each one. 

Inserting log statements in cluster::Connection::decode and giveReadCredit shows that there are long runs of (up to 50) of 64k reads being made with no credit.
Comment 1 Alan Conway 2009-06-22 13:53:28 EDT
Created attachment 348952 [details]
Patch to address the issue by reducing log output

The broker unresponsiveness is caused by the cost of logging a non-attached error for each frame. This patch makes the broker log only the first not-attached error and resolves the observed problem.

There still appears to be an underlying issue with broker flow control, see bug 507421
Comment 2 Gordon Sim 2009-06-23 05:08:54 EDT
Fixed in qpidd-0.5.752581-20.
Comment 3 Jan Sarenik 2009-07-03 04:26:41 EDT
Reproduced on qpidd-0.5.752581-17.el5 / RHEL5 i386
Comment 4 Jan Sarenik 2009-07-03 04:44:42 EDT
Verified on qpidd-0.5.752581-22.el5, RHEL5 i386 and x86_64

Versions of related packages:
  qpid* is 0.5.752581-22
  rhm-0.5.3206-5.el5
  openais-0.80.3-22.el5_3.8
Comment 5 Jan Sarenik 2009-07-03 09:17:20 EDT
Created attachment 350430 [details]
reproducer script

contains merely what is already written above
Comment 7 errata-xmlrpc 2009-07-14 13:32:07 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1153.html

Note You need to log in before you can comment on or make changes to this bug.