Bug 510504

Summary:

clustered qpidd hang - cluster unresponsive (error Connection timed out: closing)

Product:

Red Hat Enterprise MRG

Reporter:

Frantisek Reznicek <freznice>

Component:

qpid-cpp

Assignee:

Alan Conway <aconway>

Status:

CLOSED ERRATA

QA Contact:

MRG Quality Engineering <mrgqe-bugs>

Severity:

high

Docs Contact:

Priority:

high

Version:

1.1.2

CC:

cctrieloff, esammons, gsim

Target Milestone:

1.1.6

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-07-14 17:32:10 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
pstack and backtrace	none
log file from 'master' node	none
Patch to fix the issue.	none
Patch based on 752581-25	none

Description Frantisek Reznicek 2009-07-09 14:38:14 UTC

Created attachment 351081 [details]
pstack and backtrace

Description of problem:
During the bug 506758 validation I triggered new behavior of the cluster on -25 packages.
Cluster may become unresponsive to any client requests.
See detailed call trace below...

Version-Release number of selected component (if applicable):
[root@mrg-qe-01 bz506758]# rpm -qa | grep -E '(qpid|rhm|openais)' | sort -u
openais-0.80.3-22.el5_3.8
openais-debuginfo-0.80.3-22.el5_3.8
python-qpid-0.5.752581-3.el5
qpidc-0.5.752581-25.el5
qpidc-debuginfo-0.5.752581-25.el5
qpidc-devel-0.5.752581-25.el5
qpidc-perftest-0.5.752581-25.el5
qpidc-rdma-0.5.752581-25.el5
qpidc-ssl-0.5.752581-25.el5
qpidd-0.5.752581-25.el5
qpidd-acl-0.5.752581-25.el5
qpidd-cluster-0.5.752581-25.el5
qpidd-devel-0.5.752581-25.el5
qpid-dotnet-0.4.738274-2.el5
qpidd-rdma-0.5.752581-25.el5
qpidd-ssl-0.5.752581-25.el5
qpidd-xml-0.5.752581-25.el5
qpid-java-client-0.5.751061-8.el5
qpid-java-common-0.5.751061-8.el5
rhm-0.5.3206-6.el5
rhm-docs-0.5.756148-1.el5

How reproducible:
90% (easily)

Steps to Reproduce:
1. run the https://bugzilla.redhat.com/attachment.cgi?id=351073 repro
   as './run.sh 5 75 kill'
   (press Enter when you get "x", or remove 'read x' ilne)
2. wait for hang, looks like this:
  broker[s] running (pids:16122 16131 16417 16800 16964 , ports:5672 10001   10002 10003 10004 )
  Waiting for clients...
  75.75.75.75.75.75.75.75.75.75.K>R>75.75.75.75.75.75.75.75.75.75.K>R>75.     
3.
  
Actual results:
Cluster become unresponsive.

Expected results:
Cluster should not become unresponsive.

Additional info (pstack and backtrace):

  transcript bzipped & attached

Comment 1 Gordon Sim 2009-07-10 10:40:00 UTC

I reproduced similar behaviour in a slightly different way and I think can rule out bug 494393 as the cause as the cluster was successfully started and the master never killed after that.

I used the subscribe program attached to bug 506758. I started 25 instances with only the master node in the url and 25 with all four nodes in the url. In each case retry-interval was 0. All clients were started before the brokers, then four cluster nodes were started and correct functionining of the cluster was observed.

I then bounced particular nodes at intervals and this (quite quickly) resulted in a hung cluster. The cluster remained unresponsive even after all clients were stopped and all but the master node were killed. In other words I believe the hang is unrecoverable except through a total cluster restart.

The pstack output for the master node appears normal, all threads in Poller::wait() or Timer::run(); the log didn't have any unusual messages.

Comment 2 Gordon Sim 2009-07-10 10:57:47 UTC

Reproduced with --log-enable debug+:cluster. Once hung, a failing connection attempt produces the following:

2009-jul-10 06:51:21 debug 10.16.44.221:23777(READY/error) new connection: 10.16.44.221:23777-60(local)
2009-jul-10 06:51:21 debug 10.16.44.221:23777(READY/error) add local connection 10.16.44.221:23777-60
2009-jul-10 06:51:23 debug 10.16.44.221:23777(READY/error) local close of replicated connection 10.16.44.221:23777-60(local)

Comment 3 Gordon Sim 2009-07-10 10:58:44 UTC

Created attachment 351238 [details]
log file from 'master' node

Comment 4 Alan Conway 2009-07-10 15:51:08 UTC

Created attachment 351276 [details]
Patch to fix the issue.

Also comitted to trunk r792991

Comment 5 Alan Conway 2009-07-10 16:27:37 UTC

Created attachment 351279 [details]
Patch based on 752581-25

Comment 6 Frantisek Reznicek 2009-07-13 09:06:57 UTC

The issue has been fixed on RHEL 5.3 i386 / x86_64 on packages:
openais-0.80.3-22.el5_3.8
openais-debuginfo-0.80.3-22.el5_3.8
openais-devel-0.80.3-22.el5_3.8
python-qpid-0.5.752581-3.el5
qpidc-0.5.752581-26.el5
qpidc-debuginfo-0.5.752581-26.el5
qpidc-devel-0.5.752581-26.el5
qpidc-perftest-0.5.752581-26.el5
qpidc-rdma-0.5.752581-26.el5
qpidc-ssl-0.5.752581-26.el5
qpidd-0.5.752581-26.el5
qpidd-acl-0.5.752581-26.el5
qpidd-cluster-0.5.752581-26.el5
qpidd-devel-0.5.752581-26.el5
qpid-dotnet-0.4.738274-2.el5
qpidd-rdma-0.5.752581-26.el5
qpidd-ssl-0.5.752581-26.el5
qpidd-xml-0.5.752581-26.el5
qpid-java-client-0.5.751061-8.el5
qpid-java-common-0.5.751061-8.el5
rhm-0.5.3206-9.el5
rhm-docs-0.5.756148-1.el5

->VERIFIED

Comment 8 errata-xmlrpc 2009-07-14 17:32:10 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1153.html