Bug 508675 - Unresponsive qpidd process hangs the cluster
Unresponsive qpidd process hangs the cluster
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp (Show other bugs)
All Linux
urgent Severity high
: 1.3
: ---
Assigned To: Alan Conway
Jan Sarenik
: 515026 (view as bug list)
Depends On:
  Show dependency treegraph
Reported: 2009-06-29 09:09 EDT by Alan Conway
Modified: 2010-10-14 11:58 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, when a clustered broker stopped responding, it may have caused the entire cluster to stop responding as well. To prevent this, a watchdog mechanism to detect and eventually kill an unresponsive qpidd process has been introduced.
Story Points: ---
Clone Of:
Last Closed: 2010-10-14 11:58:48 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
watchdog source code (10.00 KB, application/x-tar)
2009-08-11 15:44 EDT, Alan Conway
no flags Details

  None (edit)
Description Alan Conway 2009-06-29 09:09:08 EDT
Description of problem:

If a broker process in a cluster hangs (simulated by SIGSTOP/CTRL-Z) the cluster continues to function for a while, then the entire cluster hangs. openais on the hung node appears to be in a busy loop. If the hung node is killed the rest of the cluster resumes with no message loss.

We should add a mechanism to detect and kill an  unresponsive qpidd process in some configured timeout.

Version-Release number of selected component (if applicable):

How reproducible: easy

Steps to Reproduce:
0. yum install openais qpidd qpidc-perftest 
1. start aisexec and qpidd --cluster-name testcluster on each nodes.
2. Run attached tuneup script on each node. Not sure if this is essential to
reproduce. It's written for a dual quad-core box, may need adjustment.
3. On host X run while true; do perftest; done
4. On host Y, ctrl-z or kill -STOP the qpidd process.

Actual results:

Perftest on host X is hung, no activity on any host but Y. Host Y is in a hard
loop and unresponsive.

Expected results:

Unresponsive qpidd process should be killed after some configurable timeout.

Additional info:

The openais part of this is bug 504788
Comment 1 Alan Conway 2009-06-29 09:18:00 EDT

The simple qpid_ping client program at:
tries to contact a broker and send itself a message. If it does not succeed within a configured timeout (1 second by default) it returns non-0 exit status.

The user could write a simple monitoring script to restart an unresponsive qpidd such as:

while true; do 
  qpid_ping --timeout 5 || { pkill qpidd; qpidd -d; }
  sleep 5m

For more drastic action you could use qpid_ping with watchdog(8) to reboot the system if qpidd becomes unresponsive.
Comment 2 Alan Conway 2009-07-02 16:46:21 EDT
from sdake: 

Ok so after thinking about this more, I have come to the conclusion what
you really want is an availability manager.

There are currently 3 choices.

rgmanager - simple, requires the integration of rgmanager as a
dependency, requires development of some scripts to monitor status of
the daemon.  Works multi-node.  Lon can provide more details here.  Long
term I think rgmanager will be replaced by Pacemaker.  Available in
AMF - Provides a well specified programmatic interface for doing
availability management.  Works multi-node.  Current implementation not
suitable for deployment and wont be for a long time.
Pacemaker - very complex to setup but also allows very complex
configurations.  Works multi-node.  May be / is overkill for your
situation.  Likely standard choice for future RHEL.  Won't be available
in RHEL5.

?? - The component we are missing from openais or corosync has the
following (very narrow) use case: application healthchecking and service
application restart for a replicating state machine process.  The
features are 1) simple - drop in 1 config file per process and service
is started/stopped automatically 2) programmatic C interface for
executing healthchecking and requesting restart 3) integrated into
openais or corosync in a way that allows simple single process restart
4) no clustering concepts integrated into this service.  The basic idea
of such a service is to provide dead simple process restart for cpg
replication applications 5) wide distribution without dependency issues.

I am not sure if we need a 4th availability manager but the concept of
simple, non-cluster aware, replicating process failure detector and
restart mechanism to accompany CPG application design is appealing for
very simple use cases (and targeting open source and startup
Comment 3 Alan Conway 2009-08-06 09:10:00 EDT
See bug 515026 for another way that an unresponsive broker can hang the cluster.
Comment 4 Alan Conway 2009-08-06 09:10:09 EDT
*** Bug 515026 has been marked as a duplicate of this bug. ***
Comment 5 Alan Conway 2009-08-10 17:13:30 EDT
A simple watchdog feature for qpidd in a cluster has been committed at revision 508675.

    Watchdog feature to remove unresponsive cluster nodes.

    In some intstances (e.g. while resolving an error) it's possible for a
    hung process to hang the entire cluster as they wait for its response.
    The cluster can handle terminated processes but hung processes present
    a problem.

    If the watchdog plugin is loaded and --watchdog-interval is set then
    the broker forks a child process that runs a very simple watchdog
    program, and starts a timer in the broker process to signal the watchdog
    every interval/2 seconds. The watchdog kills its parent if it does not
    receive a signal for interval seconds. This allows a stuck broker to be
    removed from the cluster so other cluster members can continue.
Comment 6 Alan Conway 2009-08-11 15:44:34 EDT
Created attachment 357070 [details]
watchdog source code 

Builds and runs with qpidd-devel-0.5.752581-26.el5
Comment 8 Alan Conway 2009-09-11 09:28:01 EDT
The watchdog feature was added to the product in SVN r802927
Comment 10 Jan Sarenik 2010-07-22 10:36:33 EDT
Reproduced on qpidd-cluster-0.5.752581-34.el5 (current stable)
Comment 11 Jan Sarenik 2010-07-23 04:41:28 EDT
Current watchdog actually does not help in case the broker was
STOPped by CTRL-Z (or kill -STOP). The stopped node is removed
from cluster merely when the process is resumed and receives
the signal from watchdog to shut down.

Is this expected? If yes, I do not know how to test it then.

Comment 12 Jan Sarenik 2010-07-23 06:28:44 EDT
The problem is that while the broker (node) does not leave the
cluster, the same hang appears as on the version without
watchdog. Can this be fixed so the watchdog seds also
SIGSTOP if the process is in stopped state?
Comment 13 Alan Conway 2010-07-23 11:38:07 EDT
This is working for me: 

[aconway@mrg32 ~]$ qpidd --watchdog-interval=5 --daemon
[aconway@mrg32 ~]$ pgrep -lf qpidd
21061 qpidd --watchdog-interval=5 --daemon
21063 /home/remote/aconway/install/libexec/qpid/qpidd_watchdog 5
[aconway@mrg32 ~]$ kill -stop 21061; while pgrep qpidd; do sleep 1; done; 
[aconway@mrg32 ~]$ pgrep qpidd
[aconway@mrg32 ~]$ 

It also works with a broker in a cluster.

How are you testing this?
Comment 14 Jan Sarenik 2010-07-26 04:50:05 EDT
Aah, I haven't tried to run it as daemon. I run it on foreground
and suspended with CTRL-Z in Bash. Will try this today. Thanks
for info.
Comment 15 Jan Sarenik 2010-07-26 08:36:52 EDT
It works as expected. Tested on RHEL5 i386 and x86_64,

# qpidd --daemon --auth=no --cluster-name=ahoj --watchdog-interval 5
# qpidd --daemon --auth=no --cluster-name=ahoj --data-dir /tmp/qpidd2 -p 12345 --watchdog-interval 5
# qpid-cluster 
  Cluster Name: ahoj
Cluster Status: ACTIVE
  Cluster Size: 2
       Members: ID= URL=amqp:tcp:
              : ID= URL=amqp:tcp:
# kill -STOP <PID>
# sleep 5
# qpid-cluster
  Cluster Name: ahoj
Cluster Status: ACTIVE
  Cluster Size: 1
       Members: ID= URL=amqp:tcp:
Comment 16 Jaromir Hradilek 2010-10-07 12:59:32 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    New Contents:
Previously, when a clustered broker stopped responding, it may have caused the entire cluster to stop responding as well. To prevent this, a watchdog mechanism to detect and eventually kill an unresponsive qpidd process has been introduced.
Comment 18 errata-xmlrpc 2010-10-14 11:58:48 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.