Bug 510504 - clustered qpidd hang - cluster unresponsive (error Connection timed out: closing)
Summary: clustered qpidd hang - cluster unresponsive (error Connection timed out: clos...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 1.1.2
Hardware: All
OS: Linux
high
high
Target Milestone: 1.1.6
: ---
Assignee: Alan Conway
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-07-09 14:38 UTC by Frantisek Reznicek
Modified: 2015-11-16 01:11 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-07-14 17:32:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
pstack and backtrace (10.09 KB, application/x-bzip2)
2009-07-09 14:38 UTC, Frantisek Reznicek
no flags Details
log file from 'master' node (292.42 KB, application/x-compressed-tar)
2009-07-10 10:58 UTC, Gordon Sim
no flags Details
Patch to fix the issue. (6.01 KB, patch)
2009-07-10 15:51 UTC, Alan Conway
no flags Details | Diff
Patch based on 752581-25 (5.54 KB, patch)
2009-07-10 16:27 UTC, Alan Conway
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2009:1153 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging bug fixing update 2009-07-14 17:31:48 UTC

Description Frantisek Reznicek 2009-07-09 14:38:14 UTC
Created attachment 351081 [details]
pstack and backtrace

Description of problem:
During the bug 506758 validation I triggered new behavior of the cluster on -25 packages.
Cluster may become unresponsive to any client requests.
See detailed call trace below...

Version-Release number of selected component (if applicable):
[root@mrg-qe-01 bz506758]# rpm -qa | grep -E '(qpid|rhm|openais)' | sort -u
openais-0.80.3-22.el5_3.8
openais-debuginfo-0.80.3-22.el5_3.8
python-qpid-0.5.752581-3.el5
qpidc-0.5.752581-25.el5
qpidc-debuginfo-0.5.752581-25.el5
qpidc-devel-0.5.752581-25.el5
qpidc-perftest-0.5.752581-25.el5
qpidc-rdma-0.5.752581-25.el5
qpidc-ssl-0.5.752581-25.el5
qpidd-0.5.752581-25.el5
qpidd-acl-0.5.752581-25.el5
qpidd-cluster-0.5.752581-25.el5
qpidd-devel-0.5.752581-25.el5
qpid-dotnet-0.4.738274-2.el5
qpidd-rdma-0.5.752581-25.el5
qpidd-ssl-0.5.752581-25.el5
qpidd-xml-0.5.752581-25.el5
qpid-java-client-0.5.751061-8.el5
qpid-java-common-0.5.751061-8.el5
rhm-0.5.3206-6.el5
rhm-docs-0.5.756148-1.el5

How reproducible:
90% (easily)

Steps to Reproduce:
1. run the https://bugzilla.redhat.com/attachment.cgi?id=351073 repro
   as './run.sh 5 75 kill'
   (press Enter when you get "x", or remove 'read x' ilne)
2. wait for hang, looks like this:
  broker[s] running (pids:16122 16131 16417 16800 16964 , ports:5672 10001   10002 10003 10004 )
  Waiting for clients...
  75.75.75.75.75.75.75.75.75.75.K>R>75.75.75.75.75.75.75.75.75.75.K>R>75.     
3.
  
Actual results:
Cluster become unresponsive.

Expected results:
Cluster should not become unresponsive.

Additional info (pstack and backtrace):

  transcript bzipped & attached

Comment 1 Gordon Sim 2009-07-10 10:40:00 UTC
I reproduced similar behaviour in a slightly different way and I think can rule out bug 494393 as the cause as the cluster was successfully started and the master never killed after that.

I used the subscribe program attached to bug 506758. I started 25 instances with only the master node in the url and 25 with all four nodes in the url. In each case retry-interval was 0. All clients were started before the brokers, then four cluster nodes were started and correct functionining of the cluster was observed.

I then bounced particular nodes at intervals and this (quite quickly) resulted in a hung cluster. The cluster remained unresponsive even after all clients were stopped and all but the master node were killed. In other words I believe the hang is unrecoverable except through a total cluster restart.

The pstack output for the master node appears normal, all threads in Poller::wait() or Timer::run(); the log didn't have any unusual messages.

Comment 2 Gordon Sim 2009-07-10 10:57:47 UTC
Reproduced with --log-enable debug+:cluster. Once hung, a failing connection attempt produces the following:

2009-jul-10 06:51:21 debug 10.16.44.221:23777(READY/error) new connection: 10.16.44.221:23777-60(local)
2009-jul-10 06:51:21 debug 10.16.44.221:23777(READY/error) add local connection 10.16.44.221:23777-60
2009-jul-10 06:51:23 debug 10.16.44.221:23777(READY/error) local close of replicated connection 10.16.44.221:23777-60(local)

Comment 3 Gordon Sim 2009-07-10 10:58:44 UTC
Created attachment 351238 [details]
log file from 'master' node

Comment 4 Alan Conway 2009-07-10 15:51:08 UTC
Created attachment 351276 [details]
Patch to fix the issue.

Also comitted to trunk r792991

Comment 5 Alan Conway 2009-07-10 16:27:37 UTC
Created attachment 351279 [details]
Patch based on 752581-25

Comment 6 Frantisek Reznicek 2009-07-13 09:06:57 UTC
The issue has been fixed on RHEL 5.3 i386 / x86_64 on packages:
openais-0.80.3-22.el5_3.8
openais-debuginfo-0.80.3-22.el5_3.8
openais-devel-0.80.3-22.el5_3.8
python-qpid-0.5.752581-3.el5
qpidc-0.5.752581-26.el5
qpidc-debuginfo-0.5.752581-26.el5
qpidc-devel-0.5.752581-26.el5
qpidc-perftest-0.5.752581-26.el5
qpidc-rdma-0.5.752581-26.el5
qpidc-ssl-0.5.752581-26.el5
qpidd-0.5.752581-26.el5
qpidd-acl-0.5.752581-26.el5
qpidd-cluster-0.5.752581-26.el5
qpidd-devel-0.5.752581-26.el5
qpid-dotnet-0.4.738274-2.el5
qpidd-rdma-0.5.752581-26.el5
qpidd-ssl-0.5.752581-26.el5
qpidd-xml-0.5.752581-26.el5
qpid-java-client-0.5.751061-8.el5
qpid-java-common-0.5.751061-8.el5
rhm-0.5.3206-9.el5
rhm-docs-0.5.756148-1.el5

->VERIFIED

Comment 8 errata-xmlrpc 2009-07-14 17:32:10 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1153.html


Note You need to log in before you can comment on or make changes to this bug.