Bug 510504
Summary: | clustered qpidd hang - cluster unresponsive (error Connection timed out: closing) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Frantisek Reznicek <freznice> | ||||||||||
Component: | qpid-cpp | Assignee: | Alan Conway <aconway> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | MRG Quality Engineering <mrgqe-bugs> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | 1.1.2 | CC: | cctrieloff, esammons, gsim | ||||||||||
Target Milestone: | 1.1.6 | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | All | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2009-07-14 17:32:10 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
Frantisek Reznicek
2009-07-09 14:38:14 UTC
I reproduced similar behaviour in a slightly different way and I think can rule out bug 494393 as the cause as the cluster was successfully started and the master never killed after that. I used the subscribe program attached to bug 506758. I started 25 instances with only the master node in the url and 25 with all four nodes in the url. In each case retry-interval was 0. All clients were started before the brokers, then four cluster nodes were started and correct functionining of the cluster was observed. I then bounced particular nodes at intervals and this (quite quickly) resulted in a hung cluster. The cluster remained unresponsive even after all clients were stopped and all but the master node were killed. In other words I believe the hang is unrecoverable except through a total cluster restart. The pstack output for the master node appears normal, all threads in Poller::wait() or Timer::run(); the log didn't have any unusual messages. Reproduced with --log-enable debug+:cluster. Once hung, a failing connection attempt produces the following: 2009-jul-10 06:51:21 debug 10.16.44.221:23777(READY/error) new connection: 10.16.44.221:23777-60(local) 2009-jul-10 06:51:21 debug 10.16.44.221:23777(READY/error) add local connection 10.16.44.221:23777-60 2009-jul-10 06:51:23 debug 10.16.44.221:23777(READY/error) local close of replicated connection 10.16.44.221:23777-60(local) Created attachment 351238 [details]
log file from 'master' node
Created attachment 351276 [details]
Patch to fix the issue.
Also comitted to trunk r792991
Created attachment 351279 [details]
Patch based on 752581-25
The issue has been fixed on RHEL 5.3 i386 / x86_64 on packages: openais-0.80.3-22.el5_3.8 openais-debuginfo-0.80.3-22.el5_3.8 openais-devel-0.80.3-22.el5_3.8 python-qpid-0.5.752581-3.el5 qpidc-0.5.752581-26.el5 qpidc-debuginfo-0.5.752581-26.el5 qpidc-devel-0.5.752581-26.el5 qpidc-perftest-0.5.752581-26.el5 qpidc-rdma-0.5.752581-26.el5 qpidc-ssl-0.5.752581-26.el5 qpidd-0.5.752581-26.el5 qpidd-acl-0.5.752581-26.el5 qpidd-cluster-0.5.752581-26.el5 qpidd-devel-0.5.752581-26.el5 qpid-dotnet-0.4.738274-2.el5 qpidd-rdma-0.5.752581-26.el5 qpidd-ssl-0.5.752581-26.el5 qpidd-xml-0.5.752581-26.el5 qpid-java-client-0.5.751061-8.el5 qpid-java-common-0.5.751061-8.el5 rhm-0.5.3206-9.el5 rhm-docs-0.5.756148-1.el5 ->VERIFIED An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-1153.html |