From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.2) Gecko/20030708 Description of problem: The 'restart count' of the 'clustat' output does not work. We have a (buggy) application that frequently crashes. The application is correctly restarted by the cluster manager and the time of the restart is updated correctly in the column 'last transition'. Only the count is not updated. Version-Release number of selected component (if applicable): clumanager-1.0.26-2 How reproducible: Always Steps to Reproduce: 1. Have a simple cluster service running (a simple shell script like !/bin/sh while true ; do uptime >/tmp/uptime.log ; sleep 5 ; done will be sufficient. For this example we assume this script is named 'uptime.sh') 2. ps -ealf | grep uptime.sh kill <pid-of-uptime.sh> 3. wait for 'monitor interval' 4. clustat service is shown as being recently restarted (and correctly running), but 'restart count' remains zero. Actual Results: ervice is shown as being recently restarted, but 'restart count' remains zero. Expected Results: restart count should be '1'. (or higher number, depending on the number of failures) Additional info:
Created attachment 101185 [details] Fixes restart if check fails
Do you need a package with the above patch applied for testing?
Yes please!
http://people.redhat.com/lhh/clumanager-1.0.27-0.bz126125.unsupported.test.only.i386.rpm http://people.redhat.com/lhh/clumanager-1.0.27-0.bz126125.unsupported.test.only.src.rpm Let me know how it works. Note that this is a test-only rpm; don't use it in production.
Yes, it works. I checked several times and the restart count now is incremented nicely. Thanks for your fast help! When will the patch be integrated into the next 'official' update? (It is amazing that no one noticed this bug before!) I had to revert to 1.0.26 since the tests had to be done on a production server! Regards, S. Wonczak
Not sure at the moment. I'll have our support staff take a look at it and evaluate it. In the meantime, you could add a bit to your service script which records each time it is started to a log file and monitors that log file for activity over short periods of time. Or, you could also do something more intelligent using timestamps so you can tell that the service is restarting every status-check interval. For example, if the service check interval is 300 seconds (5 minutes) and the service is restarted < 600 seconds (10 minutes) later, it probably was a result of the status check failing -- send email to admin.
Hmmm.... A few moments ago I ckecked out the just-released clumanager-1.0.27-1 package. Unfortunately, the bugfix concerning the restart-count is still not in. Any chances of a new release with the bugfix added?
It will go in U6.
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-493.html