Red Hat Bugzilla – Bug 488072
[RFE] check for crashed clurgmgrd process
Last modified: 2016-04-26 11:29:47 EDT
Created attachment 333737 [details]
script to use as a cluster service for testing
Description of problem:
When the clurgmgrd process disappears, for example because it was selected by the Out of Memory killer, cluster services continue to run on node A. When the rgmanager service is started on node B afterwards, node B starts another instance of all services which are already running on node A. In the case of virtual machines, this leads to data loss.
We lost our Satellite this way, after HP monitoring software triggered the Out of Memory killer in the Dom0 of one node and killed the clurgmgrd process.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. define a cluster service, like the attached sleep.sh
<service autostart="1" name="sleep">
<script file="/usr/local/bin/sleep.sh" name="sleep"/>
2. start this service on node A
3. "kill -9" the two clurgmgrd processes on node A
4. reboot node B, or just restart the rgmanager system service
The "sleep" script runs on both nodes at the same time.
When rgmanager on node B detects that a service is running on node A, but rgmanager is not, that service should be marked as "failed", preventing other nodes from starting it. Additionally, node A could try to recover by restarting rgmanager or by rebooting.
Created attachment 333758 [details]
clustat output and syslog from node which starts second instance of service:sleep
The lower-numbered PID is not going to get killed by the OOM killer; it never allocates memory once running. It sits in wait() for the higher-numbered PID (the actual main process of rgmanager) to exit, and reboots the machine if the child process exits unexpectedly using a fatal signal.
Now, the watchdog process as we call it will exit if the child was terminated with SIGKILL (i.e. admin intervention). This is expected behavior. However, other fatal signals (SIGILL, SIGFPE, SIGSEGV, etc.) will cause the first PID (lower #'d PID) to reboot the machine. So, if rgmanager runs into internal memory corruption or something, the machine will reboot so failover is safe at that point.
Unfortunately for us, the OOM killer uses SIGKILL, so we either need a way to distinguish whether the child was killed via OOM (reboot) or we need to just reboot if the child has been killed with SIGKILL always.
So, the best thing I can come up with is basically sleeping for 2-3 seconds so that a 'killall -9' works correctly (doesn't cause reboot) while a kill -9 of the real" rgmanager process (e.g. the one doing lots of memory operations and so forth).
In addition, we need to mlockall(MCL_CURRENT) on the watchdog process, so that it won't get paged out.
Wow. Bugzilla's formatting rocks. I really didn't put carriage returns in that last comment.
~~ Attention - RHEL 5.4 Beta Released! ~~
RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!
If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.
Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.
Questions can be posted to this bug or your customer or partner representative.
I'm very sorry but with the RHEL 5u4 beta packages the issue is still there. For testing I created a 2 node cluster running full RHEL 5u4 beta and a test service discribed earlier in this bugzilla. While killing the two clurgmgrd on the node where the service is running and restarting rgmanager on the other node will end up in the service running on both nodes and clustat telling the service is stopped.
Thank you for your testing feedback. I'm sorry to hear too that the issue was not resolved, as expected.
Unfortunately, due to the fact that this failure was found so late in the release cycle, I believe we're going to have to defer the fix to RHEL 5.5, unless there is a strong business case for it.
Please state your opinion on this matter. Thank you.
Well because we only use VM's within our test environment and we have good backups this could wait till 5.5, but if this could be fixed earlier as a errata then please do.
Customers who run VM's for production may have a more urgent need for the fix, becuase it's possible that a VM lifes more than one's within the cluster and this will break the local filesystems of the VM as it did with our RHN Satellite VM within the test environment.
Our engineering team has determined that they'll be able to better address this issue in RHEL 5.5. Therefore, this issue will unfortunately remain unaddressed in 5.4.0.
Perhaps it is important to note:
killall -9 clurgmgrd # will never work
kill -9 `pidof -s clurgmgrd` # works fine on 5.4 beta
The first test also kills the monitoring process responsible for rebooting the node if the main clurgmgrd process dies. One can not expect the monitoring process to perform its function if it is dead.
If this is inadequate, additional process monitoring can be provided by the watchdog package.
SOLUTION #1 (using rgmanager's built in process monitoring):
(1) Start rgmanager normally.
(2) Test by running:
kill -9 `pidof -s clurgmgrd`
You should see:
Sep 21 15:55:38 east-04 clurgmgrd: <crit> Watchdog: Daemon died, rebooting...
SOLUTION #2 (test using 'killall'):
(1) Edit (or create, if not already existing) /etc/sysconfig/cluster. Add the
(2) Install the 'watchdog' package:
yum install -y watchdog -or-
(3) Edit /etc/modprobe.conf and add an appropriate watchdog device for your
alias watchdog my_device
If you do not an appropriate device or do not know what device you have
available, add the following:
alias watchdog softdog
(4) Trick the watchdog init script to load the watchdog driver on start by
adding the following to /etc/sysconfig/watchdog:
# Trick to load the right module
(5) Create a monitoring script to check for rgmanager's viability. For
/sbin/service rgmanager status
if [ $ret -eq 0 ]; then
# running = OK
elif [ $ret -eq 3 ]; then
# cleanly stopped = OK
(6) Create /etc/watchdog.conf with the following template:
watchdog-device = /dev/watchdog
realtime = yes
priority = 1
# point test-binary at your monitoring script in step 5
test-binary = /root/rgmanager-test
(7) Test your configuration.
service rgmanager start
service watchdog start
killall -9 clurgmgrd
You should see:
Sep 21 15:46:46 east-04 watchdog: test=/root/rgmanager-test(0)
repair=none alive=/dev/watchdog heartbeat=none temp=none to=root no_act=no
Sep 21 15:47:11 east-04 watchdog: test binary returned 1
Sep 21 15:47:11 east-04 watchdog: shutting down the system because
of error 1
(8) *IF* you have rgmanager set to start at boot time, then you may enable the
watchdog daemon startup at boot time using chkconfig.
chkconfig --add watchdog
I Now This Issue has Been Closed For Almost A year Now . A Few Questions :
1) Has This Issue Been Dealt with in RHEL 5.5 ?
2) Im using rgmanager 2.0.52 with CentOS 5.2 , Kernel 2.6.18-92. Though Rgmanager is of the newest Version , I'm Still Experiancing these issues reuiring me to reboot My Server Each In cases I send SIGKILL to aisexec or clurgmgrd and dlm threads survive ,leaving the clurgmgrd in defunct mode . Will Upgradeing to a newer version Help me solve anything ?
1) Yes, Lon indicated in Comment #18 some possible solutions that will work on RHEL 5.5
2) CentOS is not a Red Hat product, but we welcome bug reports on Red Hat products here in our public bugzilla database. Also, if you would like technical support please login at support.redhat.com or visit www.redhat.com (or call us!) for information on subscription offerings to suit your needs.