488072 – [RFE] check for crashed clurgmgrd process

Bug 488072 - [RFE] check for crashed clurgmgrd process

Summary: [RFE] check for crashed clurgmgrd process

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	rgmanager
Sub Component:
Version:	5.4
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	557292
TreeView+	depends on / blocked

Reported:	2009-03-02 14:28 UTC by Carsten Clasohm
Modified:	2018-10-20 03:29 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-09-21 19:57:32 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
script to use as a cluster service for testing (238 bytes, application/x-sh) 2009-03-02 14:28 UTC, Carsten Clasohm	no flags	Details
clustat output and syslog from node which starts second instance of service:sleep (2.50 KB, text/plain) 2009-03-02 16:29 UTC, Carsten Clasohm	no flags	Details
View All

Description Carsten Clasohm 2009-03-02 14:28:01 UTC

Created attachment 333737 [details]
script to use as a cluster service for testing

Description of problem:

When the clurgmgrd process disappears, for example because it was selected by the Out of Memory killer, cluster services continue to run on node A. When the rgmanager service is started on node B afterwards, node B starts another instance of all services which are already running on node A. In the case of virtual machines, this leads to data loss.

We lost our Satellite this way, after HP monitoring software triggered the Out of Memory killer in the Dom0 of one node and killed the clurgmgrd process.

Version-Release number of selected component (if applicable):

rgmanager-2.0.46-1.el5

How reproducible:

always

Steps to Reproduce:
1. define a cluster service, like the attached sleep.sh

<service autostart="1" name="sleep">
  <script file="/usr/local/bin/sleep.sh" name="sleep"/>
</service>

2. start this service on node A

3. "kill -9" the two clurgmgrd processes on node A

4. reboot node B, or just restart the rgmanager system service

Actual results:

The "sleep" script runs on both nodes at the same time.

Expected results:

When rgmanager on node B detects that a service is running on node A, but rgmanager is not, that service should be marked as "failed", preventing other nodes from starting it. Additionally, node A could try to recover by restarting rgmanager or by rebooting.

Comment 1 Carsten Clasohm 2009-03-02 16:29:11 UTC

Created attachment 333758 [details]
clustat output and syslog from node which starts second instance of service:sleep

Comment 2 Lon Hohberger 2009-04-01 21:28:02 UTC

The lower-numbered PID is not going to get killed by the OOM killer; it never allocates memory once running.  It sits in wait() for the higher-numbered PID (the actual main process of rgmanager) to exit, and reboots the machine if the child process exits unexpectedly using a fatal signal.

Now, the watchdog process as we call it will exit if the child was terminated with SIGKILL (i.e. admin intervention).  This is expected behavior.  However, other fatal signals (SIGILL, SIGFPE, SIGSEGV, etc.) will cause the first PID (lower #'d PID) to reboot the machine.  So, if rgmanager runs into internal memory corruption or something, the machine will reboot so failover is safe at that point.

Unfortunately for us, the OOM killer uses SIGKILL, so we either need a way to distinguish whether the child was killed via OOM (reboot) or we need to just reboot if the child has been killed with SIGKILL always.

Comment 3 Lon Hohberger 2009-04-09 13:22:24 UTC

So, the best thing I can come up with is basically sleeping for 2-3 seconds so that a 'killall -9' works correctly (doesn't cause reboot) while a kill -9 of the real" rgmanager process (e.g. the one doing lots of memory operations and so forth).

In addition, we need to mlockall(MCL_CURRENT) on the watchdog process, so that it won't get paged out.

Comment 4 Lon Hohberger 2009-04-09 13:23:10 UTC

Wow.  Bugzilla's formatting rocks.  I really didn't put carriage returns in that last comment.

Comment 5 Lon Hohberger 2009-05-19 20:06:37 UTC

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=d727e39ecd607b879ec3dc8841d599131ee7638d

Comment 7 Chris Ward 2009-07-03 18:26:00 UTC

~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.

Comment 8 Chris Verhoef 2009-08-04 07:35:38 UTC

I'm very sorry but with the RHEL 5u4 beta packages the issue is still there. For testing I created a 2 node cluster running full RHEL 5u4 beta and a test service discribed earlier in this bugzilla. While killing the two clurgmgrd on the node where the service is running and restarting rgmanager on the other node will end up in the service running on both nodes and clustat telling the service is stopped.

Comment 9 Chris Verhoef 2009-08-04 08:18:24 UTC

I'm very sorry but with the RHEL 5u4 beta packages the issue is still there. For testing I created a 2 node cluster running full RHEL 5u4 beta and a test service discribed earlier in this bugzilla. While killing the two clurgmgrd on the node where the service is running and restarting rgmanager on the other node will end up in the service running on both nodes and clustat telling the service is stopped.

Comment 11 Chris Ward 2009-08-04 13:59:01 UTC

Thank you for your testing feedback. I'm sorry to hear too that the issue was not resolved, as expected. 

Unfortunately, due to the fact that this failure was found so late in the release cycle, I believe we're going to have to defer the fix to RHEL 5.5, unless there is a strong business case for it. 

Please state your opinion on this matter. Thank you.

Comment 12 Chris Verhoef 2009-08-05 09:04:13 UTC

Well because we only use VM's within our test environment and we have good backups this could wait till 5.5, but if this could be fixed earlier as a errata then please do.

Customers who run VM's for production may have a more urgent need for the fix, becuase it's possible that a VM lifes more than one's within the cluster and this will break the local filesystems of the VM as it did with our RHN Satellite VM within the test environment.

Comment 15 Chris Ward 2009-08-06 08:21:57 UTC

Our engineering team has determined that they'll be able to better address this issue in RHEL 5.5. Therefore, this issue will unfortunately remain unaddressed in 5.4.0.

Comment 16 Lon Hohberger 2009-08-10 20:35:31 UTC

Perhaps it is important to note:

  killall -9 clurgmgrd          # will never work

  kill -9 `pidof -s clurgmgrd`  # works fine on 5.4 beta

The first test also kills the monitoring process responsible for rebooting the node if the main clurgmgrd process dies.  One can not expect the monitoring process to perform its function if it is dead.

If this is inadequate, additional process monitoring can be provided by the watchdog package.

Comment 18 Lon Hohberger 2009-09-21 20:00:41 UTC

Updated solutions.


SOLUTION #1 (using rgmanager's built in process monitoring):

(1) Start rgmanager normally.

(2) Test by running:

    kill -9 `pidof -s clurgmgrd`

You should see:

    Sep 21 15:55:38 east-04 clurgmgrd[4183]: <crit> Watchdog: Daemon died, rebooting...



SOLUTION #2 (test using 'killall'):

(1) Edit (or create, if not already existing) /etc/sysconfig/cluster.  Add the
following line:

    RGMGR_OPTS="-w"

(2) Install the 'watchdog' package:

    yum install -y watchdog   -or-

    up2date watchdog

(3) Edit /etc/modprobe.conf and add an appropriate watchdog device for your
system.  Example:

    alias watchdog my_device

If you do not an appropriate device or do not know what device you have
available, add the following:

    alias watchdog softdog

(4) Trick the watchdog init script to load the watchdog driver on start by
adding the following to /etc/sysconfig/watchdog:

    # Trick to load the right module
    modprobe watchdog

(5) Create a monitoring script to check for rgmanager's viability.  For
example:

    #!/bin/bash
    /sbin/service rgmanager status

    ret=$?

    if [ $ret -eq 0 ]; then
        # running = OK
        exit 0
    elif [ $ret -eq 3 ]; then 
        # cleanly stopped = OK
        exit 0
    fi

    exit 1

(6) Create /etc/watchdog.conf with the following template:

    watchdog-device = /dev/watchdog
    realtime = yes
    priority = 1
    #
    # point test-binary at your monitoring script in step 5
    #
    test-binary = /root/rgmanager-test

(7) Test your configuration.

    service rgmanager start
    service watchdog start
    killall -9 clurgmgrd

You should see:

    Sep 21 15:46:46 east-04 watchdog[16759]: test=/root/rgmanager-test(0)
repair=none alive=/dev/watchdog heartbeat=none temp=none to=root no_act=no
    Sep 21 15:47:11 east-04 watchdog[16759]: test binary returned 1
    Sep 21 15:47:11 east-04 watchdog[16759]: shutting down the system because
of error 1

(8) *IF* you have rgmanager set to start at boot time, then you may enable the
watchdog daemon startup at boot time using chkconfig.

    chkconfig --add watchdog

Comment 20 Idan Shinberg 2010-06-06 12:13:30 UTC

I Now This Issue has Been Closed For Almost A year Now . A Few Questions :

1) Has This Issue Been Dealt with in RHEL 5.5 ?

2) Im using rgmanager 2.0.52 with CentOS 5.2 , Kernel 2.6.18-92. Though Rgmanager is of the newest Version , I'm Still Experiancing these issues reuiring me to reboot My Server Each  In cases I send SIGKILL to aisexec or clurgmgrd and  dlm threads survive ,leaving the clurgmgrd in defunct mode . Will Upgradeing to a newer version Help me solve anything ?

Comment 21 Perry Myers 2010-06-11 15:28:31 UTC

1) Yes, Lon indicated in Comment #18 some possible solutions that will work on RHEL 5.5

2) CentOS is not a Red Hat product, but we welcome bug reports on Red Hat products here in our public bugzilla database. Also, if you would like technical support please login at support.redhat.com or visit www.redhat.com  (or call us!) for information on subscription offerings to suit your needs.

Note You need to log in before you can comment on or make changes to this bug.