227823 – clurgmgrd[6673]: <crit> Watchdog: Daemon died, rebooting...

Bug 227823 - clurgmgrd[6673]: <crit> Watchdog: Daemon died, rebooting...

Summary: clurgmgrd[6673]: <crit> Watchdog: Daemon died, rebooting...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	clumanager
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-02-08 13:39 UTC by Tomasz Jaszowski
Modified:	2009-04-16 20:22 UTC (History)
CC List:	4 users (show)
Fixed In Version:	RHBA-2007-0133
Clone Of:
Environment:
Last Closed:	2007-05-16 15:47:26 UTC
Embargoed:

Attachments	(Terms of Use)

Description Tomasz Jaszowski 2007-02-08 13:39:46 UTC

Description of problem:
sometimes during stopping cluster services I'm receiving clurgmgrd[6673]: <crit>
Watchdog: Daemon died, rebooting...

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:
reboot

Expected results:
i would like to have more info why it happened. I can't find any info about such
watchdog.



Additional info:

Comment 1 Lon Hohberger 2007-02-08 15:23:44 UTC

That happens if rgmanager crashes.  There are a few crash-fixes coming in the
next update.

It could theoretically also happen if rgmanager isn't down and cman tells it to
die (e.g. running cman_tool leave force ...) could have this effect.

Comment 2 Lon Hohberger 2007-02-08 15:29:41 UTC

Try starting rgmanager with:

ulimit -c unlimited
clurgmgrd -d

That will disable the watchdog.  Additionally, if rgmanager crashes on the way
down, it will produce a core file.  I need the core file and what version of
rgmanager you're using as well as processor architecture in order to debug this.

(The core file is most important)

Comment 3 Tomasz Jaszowski 2007-03-01 06:00:52 UTC

unfortunately till now i couldn't reproduce this problem in controlled
environment... but still trying.

Could You pass me more info about clurgmgrd parameters? what exactly -d option
means? are there any other options available?

Comment 4 Lon Hohberger 2007-03-05 15:54:53 UTC

-d turns on debugging and disables the internal self-monitoring "watchdog" daemon.

There aren't any other helpful options in this case.

Comment 5 Michael Hagmann 2007-03-15 12:47:21 UTC

Hi 

this is really a serious Bug. We have now at least 5 Productive Cluster who hit
this Bug. But the Problem is when we enable the Debug mode then it don't happen
again!

Also the Problem occurs more often during ore after disabling a service with
"clusvcadm -d ...".

That happen at least thrice.

Mike

Comment 6 Lon Hohberger 2007-03-15 14:47:48 UTC

Ok, I at *least* need the version of rgmanager you guys are using.

Comment 7 Michael Hagmann 2007-03-15 16:09:35 UTC

Hi

No problem I can send you any Information do you like.

This is a normal RHEL4 AS u4 / Cluster /GFS installation.

root@lilr622b:~# rpm -qa | grep rgmanager
rgmanager-1.9.54-1

Mike

Comment 8 Tomasz Jaszowski 2007-03-15 16:22:44 UTC

Hi, exactly the same system version, and we still aren't able to reproduce
problem in controlled environment (it just dying when no one is watching :) )

we are testing rgmanager-1.9.54-3.228823test now, if problem occurs I'll pass
some info

Tomek

Comment 9 Tomasz Jaszowski 2007-03-15 16:27:04 UTC

(In reply to comment #8)
> we are testing rgmanager-1.9.54-3.228823test now, if problem occurs I'll pass
> some info

PS. on production we have rgmanager-1.9.54-1, and rgmanager-1.9.54-3.228823test
on identically configured test environment

Comment 10 Michael Hagmann 2007-03-15 16:39:50 UTC

Hi I try a extensiv testing with the clusvcadm -d ??? and clusvcadm -e ??? maybe
it works and I get a crash.

for i in $(seq 1 1000) do
    clusvcadm -d $SERVICENAME
    sleep 10
    clusvcadm -e $SERVICENAME
    sleep 10
done

regards mike

Comment 11 Lon Hohberger 2007-03-16 17:29:40 UTC

The watchdog fires when the daemon crashes - ostensibly due to a segmentation
fault.  The 3.228823test package has two fixes that, if left open, could cause
this behavior.

Tomasz - with response to C#9 - the configuration between .54-0 and .54-3.228823
packages should be identical; there are no backwards-compatibility issues there  

Michael - with response to C#10 - that will eventually cause a crash due to a
race on .54, but is fixed in .54-3.228823 and the update 5 beta packages.

Could I get everyone who is on this bugzilla who is not already using
.54-3.228823 to use it?  I have a very strong suspicion that the crash causing
this symptom is fixed already.  All of the fixes in .54-3.228823 are included in
update 5.

If you need a different architecture than what is on my people page, let me know.

http://people.redhat.com/lhh/packages.html

Comment 12 Michael Hagmann 2007-03-28 19:03:26 UTC

Hi Lon

is it possible to get a Hotfix package from Red Hat Support for .54-3.228823 ?

thx mike

Comment 14 Lon Hohberger 2007-05-16 15:47:26 UTC

Sorry for the late response; this is fixed in 4.5

Comment 15 Michael Hagmann 2007-05-16 16:35:22 UTC

Hi Lon

no problem, we also received a Hotfix package from Support.

thx mike

Note You need to log in before you can comment on or make changes to this bug.