126125 – Restart Count (reported by clustat) does not work

Bug 126125 - Restart Count (reported by clustat) does not work

Summary: Restart Count (reported by clustat) does not work

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 2.1
Classification:	Red Hat
Component:	clumanager
Sub Component:
Version:	2.1
Hardware:	i686
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	123573 131576
TreeView+	depends on / blocked

Reported:	2004-06-16 12:07 UTC by Dr. Stephan Wonczak
Modified:	2007-11-30 22:06 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-12-13 21:18:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Fixes restart if check fails (1014 bytes, patch) 2004-06-16 13:19 UTC, Lon Hohberger	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2004:493	0	normal	SHIPPED_LIVE	Updated clumanager package	2004-12-13 05:00:00 UTC

Description Dr. Stephan Wonczak 2004-06-16 12:07:17 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.2)
Gecko/20030708

Description of problem:
The 'restart count' of the 'clustat' output does not work. 
We have a (buggy) application that frequently crashes. The application
is correctly restarted by the cluster manager and the time of the
restart is updated correctly in the column 'last transition'. Only the
count is not updated.


Version-Release number of selected component (if applicable):
clumanager-1.0.26-2

How reproducible:
Always

Steps to Reproduce:
1. Have a simple cluster service running (a simple shell script like 

!/bin/sh
while true ; do uptime >/tmp/uptime.log ; sleep 5 ; done

will be sufficient. For this example we assume this script is named
'uptime.sh')

2. ps -ealf | grep uptime.sh
   kill <pid-of-uptime.sh>

3. wait for 'monitor interval'

4. clustat
   service is shown as being recently restarted (and correctly
running), but 'restart count' remains zero.
    

Actual Results:  ervice is shown as being recently restarted, but
'restart count' remains zero.

Expected Results:  restart count should be '1'. (or higher number,
depending on the number of failures)

Additional info:

Comment 1 Lon Hohberger 2004-06-16 13:19:38 UTC

Created attachment 101185 [details]
Fixes restart if check fails

Comment 2 Lon Hohberger 2004-06-16 13:22:01 UTC

Do you need a package with the above patch applied for testing?

Comment 3 Dr. Stephan Wonczak 2004-06-16 13:33:20 UTC

Yes please!

Comment 4 Lon Hohberger 2004-06-16 13:59:01 UTC

http://people.redhat.com/lhh/clumanager-1.0.27-0.bz126125.unsupported.test.only.i386.rpm
http://people.redhat.com/lhh/clumanager-1.0.27-0.bz126125.unsupported.test.only.src.rpm

Let me know how it works.  Note that this is a test-only rpm; don't
use it in production.

Comment 5 Dr. Stephan Wonczak 2004-06-16 14:28:47 UTC

Yes, it works. I checked several times and the restart count now is
incremented nicely. Thanks for your fast help! 
When will the patch be integrated into the next 'official' update? (It
is amazing that no one noticed this bug before!) I had to revert to
1.0.26 since the tests had to be done on a production server! 

  Regards, S. Wonczak

Comment 6 Lon Hohberger 2004-06-16 15:29:06 UTC

Not sure at the moment.  I'll have our support staff take a look at it
and evaluate it.

In the meantime, you could add a bit to your service script which
records each time it is started to a log file and monitors that log
file for activity over short periods of time.  Or, you could also do
something more intelligent using timestamps so you can tell that the
service is restarting every status-check interval.

For example, if the service check interval is 300 seconds (5 minutes)
and the service is restarted < 600 seconds (10 minutes) later, it
probably was a result of the status check failing -- send email to admin.

Comment 9 Dr. Stephan Wonczak 2004-08-19 09:46:00 UTC

Hmmm.... A few moments ago I ckecked out the just-released
clumanager-1.0.27-1 package. Unfortunately, the bugfix concerning the
restart-count is still not in. Any chances of a new release with the
bugfix added?

Comment 10 Lon Hohberger 2004-08-23 13:52:19 UTC

It will go in U6.

Comment 12 John Flanagan 2004-12-13 21:18:53 UTC

An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-493.html

Note You need to log in before you can comment on or make changes to this bug.