Bug 508675 - Unresponsive qpidd process hangs the cluster
Summary: Unresponsive qpidd process hangs the cluster
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 1.1.1
Hardware: All
OS: Linux
urgent
high
Target Milestone: 1.3
: ---
Assignee: Alan Conway
QA Contact: Jan Sarenik
URL:
Whiteboard:
: 515026 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-06-29 13:09 UTC by Alan Conway
Modified: 2018-10-19 23:56 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, when a clustered broker stopped responding, it may have caused the entire cluster to stop responding as well. To prevent this, a watchdog mechanism to detect and eventually kill an unresponsive qpidd process has been introduced.
Clone Of:
Environment:
Last Closed: 2010-10-14 15:58:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
watchdog source code (10.00 KB, application/x-tar)
2009-08-11 19:44 UTC, Alan Conway
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0773 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Messaging and Grid Version 1.3 2010-10-14 15:56:44 UTC

Description Alan Conway 2009-06-29 13:09:08 UTC
Description of problem:

If a broker process in a cluster hangs (simulated by SIGSTOP/CTRL-Z) the cluster continues to function for a while, then the entire cluster hangs. openais on the hung node appears to be in a busy loop. If the hung node is killed the rest of the cluster resumes with no message loss.

We should add a mechanism to detect and kill an  unresponsive qpidd process in some configured timeout.

Version-Release number of selected component (if applicable):


How reproducible: easy

Steps to Reproduce:
0. yum install openais qpidd qpidc-perftest 
1. start aisexec and qpidd --cluster-name testcluster on each nodes.
2. Run attached tuneup script on each node. Not sure if this is essential to
reproduce. It's written for a dual quad-core box, may need adjustment.
3. On host X run while true; do perftest; done
4. On host Y, ctrl-z or kill -STOP the qpidd process.

Actual results:

Perftest on host X is hung, no activity on any host but Y. Host Y is in a hard
loop and unresponsive.

Expected results:

Unresponsive qpidd process should be killed after some configurable timeout.

Additional info:

The openais part of this is bug 504788

Comment 1 Alan Conway 2009-06-29 13:18:00 UTC
WORKAROUNDS

The simple qpid_ping client program at:
 https://svn.apache.org/repos/asf/qpid/trunk/qpid/cpp/src/tests/qpid_ping.cpp
tries to contact a broker and send itself a message. If it does not succeed within a configured timeout (1 second by default) it returns non-0 exit status.

The user could write a simple monitoring script to restart an unresponsive qpidd such as:

while true; do 
  qpid_ping --timeout 5 || { pkill qpidd; qpidd -d; }
  sleep 5m
done

For more drastic action you could use qpid_ping with watchdog(8) to reboot the system if qpidd becomes unresponsive.

Comment 2 Alan Conway 2009-07-02 20:46:21 UTC
from sdake: 

Ok so after thinking about this more, I have come to the conclusion what
you really want is an availability manager.

There are currently 3 choices.

rgmanager - simple, requires the integration of rgmanager as a
dependency, requires development of some scripts to monitor status of
the daemon.  Works multi-node.  Lon can provide more details here.  Long
term I think rgmanager will be replaced by Pacemaker.  Available in
rhel5/rhel6.
AMF - Provides a well specified programmatic interface for doing
availability management.  Works multi-node.  Current implementation not
suitable for deployment and wont be for a long time.
Pacemaker - very complex to setup but also allows very complex
configurations.  Works multi-node.  May be / is overkill for your
situation.  Likely standard choice for future RHEL.  Won't be available
in RHEL5.

?? - The component we are missing from openais or corosync has the
following (very narrow) use case: application healthchecking and service
application restart for a replicating state machine process.  The
features are 1) simple - drop in 1 config file per process and service
is started/stopped automatically 2) programmatic C interface for
executing healthchecking and requesting restart 3) integrated into
openais or corosync in a way that allows simple single process restart
4) no clustering concepts integrated into this service.  The basic idea
of such a service is to provide dead simple process restart for cpg
replication applications 5) wide distribution without dependency issues.

I am not sure if we need a 4th availability manager but the concept of
simple, non-cluster aware, replicating process failure detector and
restart mechanism to accompany CPG application design is appealing for
very simple use cases (and targeting open source and startup
opportunities).

Comment 3 Alan Conway 2009-08-06 13:10:00 UTC
See bug 515026 for another way that an unresponsive broker can hang the cluster.

Comment 4 Alan Conway 2009-08-06 13:10:09 UTC
*** Bug 515026 has been marked as a duplicate of this bug. ***

Comment 5 Alan Conway 2009-08-10 21:13:30 UTC
A simple watchdog feature for qpidd in a cluster has been committed at revision 508675.

    Watchdog feature to remove unresponsive cluster nodes.

    In some intstances (e.g. while resolving an error) it's possible for a
    hung process to hang the entire cluster as they wait for its response.
    The cluster can handle terminated processes but hung processes present
    a problem.

    If the watchdog plugin is loaded and --watchdog-interval is set then
    the broker forks a child process that runs a very simple watchdog
    program, and starts a timer in the broker process to signal the watchdog
    every interval/2 seconds. The watchdog kills its parent if it does not
    receive a signal for interval seconds. This allows a stuck broker to be
    removed from the cluster so other cluster members can continue.

Comment 6 Alan Conway 2009-08-11 19:44:34 UTC
Created attachment 357070 [details]
watchdog source code 

Builds and runs with qpidd-devel-0.5.752581-26.el5

Comment 8 Alan Conway 2009-09-11 13:28:01 UTC
The watchdog feature was added to the product in SVN r802927

Comment 10 Jan Sarenik 2010-07-22 14:36:33 UTC
Reproduced on qpidd-cluster-0.5.752581-34.el5 (current stable)

Comment 11 Jan Sarenik 2010-07-23 08:41:28 UTC
Current watchdog actually does not help in case the broker was
STOPped by CTRL-Z (or kill -STOP). The stopped node is removed
from cluster merely when the process is resumed and receives
the signal from watchdog to shut down.

Is this expected? If yes, I do not know how to test it then.

qpid-cpp-server-cluster-0.7.946106-8.el5

Comment 12 Jan Sarenik 2010-07-23 10:28:44 UTC
The problem is that while the broker (node) does not leave the
cluster, the same hang appears as on the version without
watchdog. Can this be fixed so the watchdog seds also
SIGSTOP if the process is in stopped state?

Comment 13 Alan Conway 2010-07-23 15:38:07 UTC
This is working for me: 

[aconway@mrg32 ~]$ qpidd --watchdog-interval=5 --daemon
[aconway@mrg32 ~]$ pgrep -lf qpidd
21061 qpidd --watchdog-interval=5 --daemon
21063 /home/remote/aconway/install/libexec/qpid/qpidd_watchdog 5
[aconway@mrg32 ~]$ kill -stop 21061; while pgrep qpidd; do sleep 1; done; 
21061
21063
21061
21063
21061
21063
[aconway@mrg32 ~]$ pgrep qpidd
[aconway@mrg32 ~]$ 

It also works with a broker in a cluster.

How are you testing this?

Comment 14 Jan Sarenik 2010-07-26 08:50:05 UTC
Aah, I haven't tried to run it as daemon. I run it on foreground
and suspended with CTRL-Z in Bash. Will try this today. Thanks
for info.

Comment 15 Jan Sarenik 2010-07-26 12:36:52 UTC
It works as expected. Tested on RHEL5 i386 and x86_64,
qpid-cpp-server-cluster-0.7.946106-8.el5

# qpidd --daemon --auth=no --cluster-name=ahoj --watchdog-interval 5
# qpidd --daemon --auth=no --cluster-name=ahoj --data-dir /tmp/qpidd2 -p 12345 --watchdog-interval 5
# qpid-cluster 
  Cluster Name: ahoj
Cluster Status: ACTIVE
  Cluster Size: 2
       Members: ID=10.34.26.26:3816 URL=amqp:tcp:10.34.26.26:12345
              : ID=10.34.26.26:3824 URL=amqp:tcp:10.34.26.26:5672
# kill -STOP <PID>
# sleep 5
# qpid-cluster
  Cluster Name: ahoj
Cluster Status: ACTIVE
  Cluster Size: 1
       Members: ID=10.34.26.26:3824 URL=amqp:tcp:10.34.26.26:5672

Comment 16 Jaromir Hradilek 2010-10-07 16:59:32 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, when a clustered broker stopped responding, it may have caused the entire cluster to stop responding as well. To prevent this, a watchdog mechanism to detect and eventually kill an unresponsive qpidd process has been introduced.

Comment 18 errata-xmlrpc 2010-10-14 15:58:48 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html


Note You need to log in before you can comment on or make changes to this bug.