Description of problem: If a broker process in a cluster hangs (simulated by SIGSTOP/CTRL-Z) the cluster continues to function for a while, then the entire cluster hangs. openais on the hung node appears to be in a busy loop. If the hung node is killed the rest of the cluster resumes with no message loss. We should add a mechanism to detect and kill an unresponsive qpidd process in some configured timeout. Version-Release number of selected component (if applicable): How reproducible: easy Steps to Reproduce: 0. yum install openais qpidd qpidc-perftest 1. start aisexec and qpidd --cluster-name testcluster on each nodes. 2. Run attached tuneup script on each node. Not sure if this is essential to reproduce. It's written for a dual quad-core box, may need adjustment. 3. On host X run while true; do perftest; done 4. On host Y, ctrl-z or kill -STOP the qpidd process. Actual results: Perftest on host X is hung, no activity on any host but Y. Host Y is in a hard loop and unresponsive. Expected results: Unresponsive qpidd process should be killed after some configurable timeout. Additional info: The openais part of this is bug 504788
WORKAROUNDS The simple qpid_ping client program at: https://svn.apache.org/repos/asf/qpid/trunk/qpid/cpp/src/tests/qpid_ping.cpp tries to contact a broker and send itself a message. If it does not succeed within a configured timeout (1 second by default) it returns non-0 exit status. The user could write a simple monitoring script to restart an unresponsive qpidd such as: while true; do qpid_ping --timeout 5 || { pkill qpidd; qpidd -d; } sleep 5m done For more drastic action you could use qpid_ping with watchdog(8) to reboot the system if qpidd becomes unresponsive.
from sdake: Ok so after thinking about this more, I have come to the conclusion what you really want is an availability manager. There are currently 3 choices. rgmanager - simple, requires the integration of rgmanager as a dependency, requires development of some scripts to monitor status of the daemon. Works multi-node. Lon can provide more details here. Long term I think rgmanager will be replaced by Pacemaker. Available in rhel5/rhel6. AMF - Provides a well specified programmatic interface for doing availability management. Works multi-node. Current implementation not suitable for deployment and wont be for a long time. Pacemaker - very complex to setup but also allows very complex configurations. Works multi-node. May be / is overkill for your situation. Likely standard choice for future RHEL. Won't be available in RHEL5. ?? - The component we are missing from openais or corosync has the following (very narrow) use case: application healthchecking and service application restart for a replicating state machine process. The features are 1) simple - drop in 1 config file per process and service is started/stopped automatically 2) programmatic C interface for executing healthchecking and requesting restart 3) integrated into openais or corosync in a way that allows simple single process restart 4) no clustering concepts integrated into this service. The basic idea of such a service is to provide dead simple process restart for cpg replication applications 5) wide distribution without dependency issues. I am not sure if we need a 4th availability manager but the concept of simple, non-cluster aware, replicating process failure detector and restart mechanism to accompany CPG application design is appealing for very simple use cases (and targeting open source and startup opportunities).
See bug 515026 for another way that an unresponsive broker can hang the cluster.
*** Bug 515026 has been marked as a duplicate of this bug. ***
A simple watchdog feature for qpidd in a cluster has been committed at revision 508675. Watchdog feature to remove unresponsive cluster nodes. In some intstances (e.g. while resolving an error) it's possible for a hung process to hang the entire cluster as they wait for its response. The cluster can handle terminated processes but hung processes present a problem. If the watchdog plugin is loaded and --watchdog-interval is set then the broker forks a child process that runs a very simple watchdog program, and starts a timer in the broker process to signal the watchdog every interval/2 seconds. The watchdog kills its parent if it does not receive a signal for interval seconds. This allows a stuck broker to be removed from the cluster so other cluster members can continue.
Created attachment 357070 [details] watchdog source code Builds and runs with qpidd-devel-0.5.752581-26.el5
The watchdog feature was added to the product in SVN r802927
Reproduced on qpidd-cluster-0.5.752581-34.el5 (current stable)
Current watchdog actually does not help in case the broker was STOPped by CTRL-Z (or kill -STOP). The stopped node is removed from cluster merely when the process is resumed and receives the signal from watchdog to shut down. Is this expected? If yes, I do not know how to test it then. qpid-cpp-server-cluster-0.7.946106-8.el5
The problem is that while the broker (node) does not leave the cluster, the same hang appears as on the version without watchdog. Can this be fixed so the watchdog seds also SIGSTOP if the process is in stopped state?
This is working for me: [aconway@mrg32 ~]$ qpidd --watchdog-interval=5 --daemon [aconway@mrg32 ~]$ pgrep -lf qpidd 21061 qpidd --watchdog-interval=5 --daemon 21063 /home/remote/aconway/install/libexec/qpid/qpidd_watchdog 5 [aconway@mrg32 ~]$ kill -stop 21061; while pgrep qpidd; do sleep 1; done; 21061 21063 21061 21063 21061 21063 [aconway@mrg32 ~]$ pgrep qpidd [aconway@mrg32 ~]$ It also works with a broker in a cluster. How are you testing this?
Aah, I haven't tried to run it as daemon. I run it on foreground and suspended with CTRL-Z in Bash. Will try this today. Thanks for info.
It works as expected. Tested on RHEL5 i386 and x86_64, qpid-cpp-server-cluster-0.7.946106-8.el5 # qpidd --daemon --auth=no --cluster-name=ahoj --watchdog-interval 5 # qpidd --daemon --auth=no --cluster-name=ahoj --data-dir /tmp/qpidd2 -p 12345 --watchdog-interval 5 # qpid-cluster Cluster Name: ahoj Cluster Status: ACTIVE Cluster Size: 2 Members: ID=10.34.26.26:3816 URL=amqp:tcp:10.34.26.26:12345 : ID=10.34.26.26:3824 URL=amqp:tcp:10.34.26.26:5672 # kill -STOP <PID> # sleep 5 # qpid-cluster Cluster Name: ahoj Cluster Status: ACTIVE Cluster Size: 1 Members: ID=10.34.26.26:3824 URL=amqp:tcp:10.34.26.26:5672
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, when a clustered broker stopped responding, it may have caused the entire cluster to stop responding as well. To prevent this, a watchdog mechanism to detect and eventually kill an unresponsive qpidd process has been introduced.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html