Created attachment 351081 [details] pstack and backtrace Description of problem: During the bug 506758 validation I triggered new behavior of the cluster on -25 packages. Cluster may become unresponsive to any client requests. See detailed call trace below... Version-Release number of selected component (if applicable): [root@mrg-qe-01 bz506758]# rpm -qa | grep -E '(qpid|rhm|openais)' | sort -u openais-0.80.3-22.el5_3.8 openais-debuginfo-0.80.3-22.el5_3.8 python-qpid-0.5.752581-3.el5 qpidc-0.5.752581-25.el5 qpidc-debuginfo-0.5.752581-25.el5 qpidc-devel-0.5.752581-25.el5 qpidc-perftest-0.5.752581-25.el5 qpidc-rdma-0.5.752581-25.el5 qpidc-ssl-0.5.752581-25.el5 qpidd-0.5.752581-25.el5 qpidd-acl-0.5.752581-25.el5 qpidd-cluster-0.5.752581-25.el5 qpidd-devel-0.5.752581-25.el5 qpid-dotnet-0.4.738274-2.el5 qpidd-rdma-0.5.752581-25.el5 qpidd-ssl-0.5.752581-25.el5 qpidd-xml-0.5.752581-25.el5 qpid-java-client-0.5.751061-8.el5 qpid-java-common-0.5.751061-8.el5 rhm-0.5.3206-6.el5 rhm-docs-0.5.756148-1.el5 How reproducible: 90% (easily) Steps to Reproduce: 1. run the https://bugzilla.redhat.com/attachment.cgi?id=351073 repro as './run.sh 5 75 kill' (press Enter when you get "x", or remove 'read x' ilne) 2. wait for hang, looks like this: broker[s] running (pids:16122 16131 16417 16800 16964 , ports:5672 10001 10002 10003 10004 ) Waiting for clients... 75.75.75.75.75.75.75.75.75.75.K>R>75.75.75.75.75.75.75.75.75.75.K>R>75. 3. Actual results: Cluster become unresponsive. Expected results: Cluster should not become unresponsive. Additional info (pstack and backtrace): transcript bzipped & attached
I reproduced similar behaviour in a slightly different way and I think can rule out bug 494393 as the cause as the cluster was successfully started and the master never killed after that. I used the subscribe program attached to bug 506758. I started 25 instances with only the master node in the url and 25 with all four nodes in the url. In each case retry-interval was 0. All clients were started before the brokers, then four cluster nodes were started and correct functionining of the cluster was observed. I then bounced particular nodes at intervals and this (quite quickly) resulted in a hung cluster. The cluster remained unresponsive even after all clients were stopped and all but the master node were killed. In other words I believe the hang is unrecoverable except through a total cluster restart. The pstack output for the master node appears normal, all threads in Poller::wait() or Timer::run(); the log didn't have any unusual messages.
Reproduced with --log-enable debug+:cluster. Once hung, a failing connection attempt produces the following: 2009-jul-10 06:51:21 debug 10.16.44.221:23777(READY/error) new connection: 10.16.44.221:23777-60(local) 2009-jul-10 06:51:21 debug 10.16.44.221:23777(READY/error) add local connection 10.16.44.221:23777-60 2009-jul-10 06:51:23 debug 10.16.44.221:23777(READY/error) local close of replicated connection 10.16.44.221:23777-60(local)
Created attachment 351238 [details] log file from 'master' node
Created attachment 351276 [details] Patch to fix the issue. Also comitted to trunk r792991
Created attachment 351279 [details] Patch based on 752581-25
The issue has been fixed on RHEL 5.3 i386 / x86_64 on packages: openais-0.80.3-22.el5_3.8 openais-debuginfo-0.80.3-22.el5_3.8 openais-devel-0.80.3-22.el5_3.8 python-qpid-0.5.752581-3.el5 qpidc-0.5.752581-26.el5 qpidc-debuginfo-0.5.752581-26.el5 qpidc-devel-0.5.752581-26.el5 qpidc-perftest-0.5.752581-26.el5 qpidc-rdma-0.5.752581-26.el5 qpidc-ssl-0.5.752581-26.el5 qpidd-0.5.752581-26.el5 qpidd-acl-0.5.752581-26.el5 qpidd-cluster-0.5.752581-26.el5 qpidd-devel-0.5.752581-26.el5 qpid-dotnet-0.4.738274-2.el5 qpidd-rdma-0.5.752581-26.el5 qpidd-ssl-0.5.752581-26.el5 qpidd-xml-0.5.752581-26.el5 qpid-java-client-0.5.751061-8.el5 qpid-java-common-0.5.751061-8.el5 rhm-0.5.3206-9.el5 rhm-docs-0.5.756148-1.el5 ->VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-1153.html