Bug 507388

Summary: Cluster becomes unresponsive after an error.
Product: Red Hat Enterprise MRG Reporter: Alan Conway <aconway>
Component: qpid-cppAssignee: Alan Conway <aconway>
Status: CLOSED ERRATA QA Contact: Jan Sarenik <jsarenik>
Severity: high Docs Contact:
Priority: high    
Version: 1.1.1CC: gsim, jsarenik
Target Milestone: 1.1.6   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-07-14 17:32:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Patch to address the issue by reducing log output
none
reproducer script none

Description Alan Conway 2009-06-22 15:43:04 UTC
Description of problem:

If a busy producer session is disconnected due to an error, the cluster  becomes unresponsive for a long period of time (up to half an hour reported) 

Version-Release number of selected component (if applicable):

This has been observed in -19, -17 and unpatched r752581 builds.

How reproducible:

Easy

Steps to Reproduce:
1. Run cluster with message store.
2.  perftest --durable=true --size=8
  
Actual results:

After a few seconds the client reports a queue capacity error. The brokers repeatedly log a "session-not-attached" error in their logs and are  unresponsive.

Expected results:

The brokers should handle the error quickly and remain responsive.

Additional info:

It looks like the cluster's flow control mechanism is not working and thousands of frames are being queued up prior to the error. When the session is detached by the queue capacity error, the brokers have to work thru the backlog of frames rejecting each one. 

Inserting log statements in cluster::Connection::decode and giveReadCredit shows that there are long runs of (up to 50) of 64k reads being made with no credit.

Comment 1 Alan Conway 2009-06-22 17:53:28 UTC
Created attachment 348952 [details]
Patch to address the issue by reducing log output

The broker unresponsiveness is caused by the cost of logging a non-attached error for each frame. This patch makes the broker log only the first not-attached error and resolves the observed problem.

There still appears to be an underlying issue with broker flow control, see bug 507421

Comment 2 Gordon Sim 2009-06-23 09:08:54 UTC
Fixed in qpidd-0.5.752581-20.

Comment 3 Jan Sarenik 2009-07-03 08:26:41 UTC
Reproduced on qpidd-0.5.752581-17.el5 / RHEL5 i386

Comment 4 Jan Sarenik 2009-07-03 08:44:42 UTC
Verified on qpidd-0.5.752581-22.el5, RHEL5 i386 and x86_64

Versions of related packages:
  qpid* is 0.5.752581-22
  rhm-0.5.3206-5.el5
  openais-0.80.3-22.el5_3.8

Comment 5 Jan Sarenik 2009-07-03 13:17:20 UTC
Created attachment 350430 [details]
reproducer script

contains merely what is already written above

Comment 7 errata-xmlrpc 2009-07-14 17:32:07 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1153.html