Bug 507388

Summary:

Cluster becomes unresponsive after an error.

Product:

Red Hat Enterprise MRG

Reporter:

Alan Conway <aconway>

Component:

qpid-cpp

Assignee:

Alan Conway <aconway>

Status:

CLOSED ERRATA

QA Contact:

Jan Sarenik <jsarenik>

Severity:

high

Docs Contact:

Priority:

high

Version:

1.1.1

CC:

gsim, jsarenik

Target Milestone:

1.1.6

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-07-14 17:32:07 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Patch to address the issue by reducing log output	none
reproducer script	none

Description Alan Conway 2009-06-22 15:43:04 UTC

Description of problem:

If a busy producer session is disconnected due to an error, the cluster  becomes unresponsive for a long period of time (up to half an hour reported) 

Version-Release number of selected component (if applicable):

This has been observed in -19, -17 and unpatched r752581 builds.

How reproducible:

Easy

Steps to Reproduce:
1. Run cluster with message store.
2.  perftest --durable=true --size=8
  
Actual results:

After a few seconds the client reports a queue capacity error. The brokers repeatedly log a "session-not-attached" error in their logs and are  unresponsive.

Expected results:

The brokers should handle the error quickly and remain responsive.

Additional info:

It looks like the cluster's flow control mechanism is not working and thousands of frames are being queued up prior to the error. When the session is detached by the queue capacity error, the brokers have to work thru the backlog of frames rejecting each one. 

Inserting log statements in cluster::Connection::decode and giveReadCredit shows that there are long runs of (up to 50) of 64k reads being made with no credit.

Comment 1 Alan Conway 2009-06-22 17:53:28 UTC

Created attachment 348952 [details]
Patch to address the issue by reducing log output

The broker unresponsiveness is caused by the cost of logging a non-attached error for each frame. This patch makes the broker log only the first not-attached error and resolves the observed problem.

There still appears to be an underlying issue with broker flow control, see bug 507421

Comment 2 Gordon Sim 2009-06-23 09:08:54 UTC

Fixed in qpidd-0.5.752581-20.

Comment 3 Jan Sarenik 2009-07-03 08:26:41 UTC

Reproduced on qpidd-0.5.752581-17.el5 / RHEL5 i386

Comment 4 Jan Sarenik 2009-07-03 08:44:42 UTC

Verified on qpidd-0.5.752581-22.el5, RHEL5 i386 and x86_64

Versions of related packages:
  qpid* is 0.5.752581-22
  rhm-0.5.3206-5.el5
  openais-0.80.3-22.el5_3.8

Comment 5 Jan Sarenik 2009-07-03 13:17:20 UTC

Created attachment 350430 [details]
reproducer script

contains merely what is already written above

Comment 7 errata-xmlrpc 2009-07-14 17:32:07 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1153.html