508675 – Unresponsive qpidd process hangs the cluster

Bug 508675 - Unresponsive qpidd process hangs the cluster

Summary: Unresponsive qpidd process hangs the cluster

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	1.1.1
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	1.3
Target Release:	---
Assignee:	Alan Conway
QA Contact:	Jan Sarenik
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	515026 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-06-29 13:09 UTC by Alan Conway
Modified:	2018-10-19 23:56 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, when a clustered broker stopped responding, it may have caused the entire cluster to stop responding as well. To prevent this, a watchdog mechanism to detect and eventually kill an unresponsive qpidd process has been introduced.
Clone Of:
Environment:
Last Closed:	2010-10-14 15:58:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
watchdog source code (10.00 KB, application/x-tar) 2009-08-11 19:44 UTC, Alan Conway	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2010:0773	0	normal	SHIPPED_LIVE	Moderate: Red Hat Enterprise MRG Messaging and Grid Version 1.3	2010-10-14 15:56:44 UTC

Description Alan Conway 2009-06-29 13:09:08 UTC

Description of problem:

If a broker process in a cluster hangs (simulated by SIGSTOP/CTRL-Z) the cluster continues to function for a while, then the entire cluster hangs. openais on the hung node appears to be in a busy loop. If the hung node is killed the rest of the cluster resumes with no message loss.

We should add a mechanism to detect and kill an  unresponsive qpidd process in some configured timeout.

Version-Release number of selected component (if applicable):


How reproducible: easy

Steps to Reproduce:
0. yum install openais qpidd qpidc-perftest 
1. start aisexec and qpidd --cluster-name testcluster on each nodes.
2. Run attached tuneup script on each node. Not sure if this is essential to
reproduce. It's written for a dual quad-core box, may need adjustment.
3. On host X run while true; do perftest; done
4. On host Y, ctrl-z or kill -STOP the qpidd process.

Actual results:

Perftest on host X is hung, no activity on any host but Y. Host Y is in a hard
loop and unresponsive.

Expected results:

Unresponsive qpidd process should be killed after some configurable timeout.

Additional info:

The openais part of this is bug 504788

Comment 1 Alan Conway 2009-06-29 13:18:00 UTC

WORKAROUNDS

The simple qpid_ping client program at:
 https://svn.apache.org/repos/asf/qpid/trunk/qpid/cpp/src/tests/qpid_ping.cpp
tries to contact a broker and send itself a message. If it does not succeed within a configured timeout (1 second by default) it returns non-0 exit status.

The user could write a simple monitoring script to restart an unresponsive qpidd such as:

while true; do 
  qpid_ping --timeout 5 || { pkill qpidd; qpidd -d; }
  sleep 5m
done

For more drastic action you could use qpid_ping with watchdog(8) to reboot the system if qpidd becomes unresponsive.

Comment 2 Alan Conway 2009-07-02 20:46:21 UTC

from sdake: 

Ok so after thinking about this more, I have come to the conclusion what
you really want is an availability manager.

There are currently 3 choices.

rgmanager - simple, requires the integration of rgmanager as a
dependency, requires development of some scripts to monitor status of
the daemon.  Works multi-node.  Lon can provide more details here.  Long
term I think rgmanager will be replaced by Pacemaker.  Available in
rhel5/rhel6.
AMF - Provides a well specified programmatic interface for doing
availability management.  Works multi-node.  Current implementation not
suitable for deployment and wont be for a long time.
Pacemaker - very complex to setup but also allows very complex
configurations.  Works multi-node.  May be / is overkill for your
situation.  Likely standard choice for future RHEL.  Won't be available
in RHEL5.

?? - The component we are missing from openais or corosync has the
following (very narrow) use case: application healthchecking and service
application restart for a replicating state machine process.  The
features are 1) simple - drop in 1 config file per process and service
is started/stopped automatically 2) programmatic C interface for
executing healthchecking and requesting restart 3) integrated into
openais or corosync in a way that allows simple single process restart
4) no clustering concepts integrated into this service.  The basic idea
of such a service is to provide dead simple process restart for cpg
replication applications 5) wide distribution without dependency issues.

I am not sure if we need a 4th availability manager but the concept of
simple, non-cluster aware, replicating process failure detector and
restart mechanism to accompany CPG application design is appealing for
very simple use cases (and targeting open source and startup
opportunities).

Comment 3 Alan Conway 2009-08-06 13:10:00 UTC

See bug 515026 for another way that an unresponsive broker can hang the cluster.

Comment 4 Alan Conway 2009-08-06 13:10:09 UTC

*** Bug 515026 has been marked as a duplicate of this bug. ***

Comment 5 Alan Conway 2009-08-10 21:13:30 UTC

A simple watchdog feature for qpidd in a cluster has been committed at revision 508675.

    Watchdog feature to remove unresponsive cluster nodes.

    In some intstances (e.g. while resolving an error) it's possible for a
    hung process to hang the entire cluster as they wait for its response.
    The cluster can handle terminated processes but hung processes present
    a problem.

    If the watchdog plugin is loaded and --watchdog-interval is set then
    the broker forks a child process that runs a very simple watchdog
    program, and starts a timer in the broker process to signal the watchdog
    every interval/2 seconds. The watchdog kills its parent if it does not
    receive a signal for interval seconds. This allows a stuck broker to be
    removed from the cluster so other cluster members can continue.

Comment 6 Alan Conway 2009-08-11 19:44:34 UTC

Created attachment 357070 [details]
watchdog source code 

Builds and runs with qpidd-devel-0.5.752581-26.el5

Comment 8 Alan Conway 2009-09-11 13:28:01 UTC

The watchdog feature was added to the product in SVN r802927

Comment 10 Jan Sarenik 2010-07-22 14:36:33 UTC

Reproduced on qpidd-cluster-0.5.752581-34.el5 (current stable)

Comment 11 Jan Sarenik 2010-07-23 08:41:28 UTC

Current watchdog actually does not help in case the broker was
STOPped by CTRL-Z (or kill -STOP). The stopped node is removed
from cluster merely when the process is resumed and receives
the signal from watchdog to shut down.

Is this expected? If yes, I do not know how to test it then.

qpid-cpp-server-cluster-0.7.946106-8.el5

Comment 12 Jan Sarenik 2010-07-23 10:28:44 UTC

The problem is that while the broker (node) does not leave the
cluster, the same hang appears as on the version without
watchdog. Can this be fixed so the watchdog seds also
SIGSTOP if the process is in stopped state?

Comment 13 Alan Conway 2010-07-23 15:38:07 UTC

This is working for me: 

[aconway@mrg32 ~]$ qpidd --watchdog-interval=5 --daemon
[aconway@mrg32 ~]$ pgrep -lf qpidd
21061 qpidd --watchdog-interval=5 --daemon
21063 /home/remote/aconway/install/libexec/qpid/qpidd_watchdog 5
[aconway@mrg32 ~]$ kill -stop 21061; while pgrep qpidd; do sleep 1; done; 
21061
21063
21061
21063
21061
21063
[aconway@mrg32 ~]$ pgrep qpidd
[aconway@mrg32 ~]$ 

It also works with a broker in a cluster.

How are you testing this?

Comment 14 Jan Sarenik 2010-07-26 08:50:05 UTC

Aah, I haven't tried to run it as daemon. I run it on foreground
and suspended with CTRL-Z in Bash. Will try this today. Thanks
for info.

Comment 15 Jan Sarenik 2010-07-26 12:36:52 UTC

It works as expected. Tested on RHEL5 i386 and x86_64,
qpid-cpp-server-cluster-0.7.946106-8.el5

# qpidd --daemon --auth=no --cluster-name=ahoj --watchdog-interval 5
# qpidd --daemon --auth=no --cluster-name=ahoj --data-dir /tmp/qpidd2 -p 12345 --watchdog-interval 5
# qpid-cluster 
  Cluster Name: ahoj
Cluster Status: ACTIVE
  Cluster Size: 2
       Members: ID=10.34.26.26:3816 URL=amqp:tcp:10.34.26.26:12345
              : ID=10.34.26.26:3824 URL=amqp:tcp:10.34.26.26:5672
# kill -STOP <PID>
# sleep 5
# qpid-cluster
  Cluster Name: ahoj
Cluster Status: ACTIVE
  Cluster Size: 1
       Members: ID=10.34.26.26:3824 URL=amqp:tcp:10.34.26.26:5672

Comment 16 Jaromir Hradilek 2010-10-07 16:59:32 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, when a clustered broker stopped responding, it may have caused the entire cluster to stop responding as well. To prevent this, a watchdog mechanism to detect and eventually kill an unresponsive qpidd process has been introduced.

Comment 18 errata-xmlrpc 2010-10-14 15:58:48 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html

Note You need to log in before you can comment on or make changes to this bug.