Bug 227823 - clurgmgrd[6673]: <crit> Watchdog: Daemon died, rebooting...
clurgmgrd[6673]: <crit> Watchdog: Daemon died, rebooting...
Status: CLOSED ERRATA
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: clumanager (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Lon Hohberger
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-02-08 08:39 EST by Tomasz Jaszowski
Modified: 2009-04-16 16:22 EDT (History)
4 users (show)

See Also:
Fixed In Version: RHBA-2007-0133
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-05-16 11:47:26 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Tomasz Jaszowski 2007-02-08 08:39:46 EST
Description of problem:
sometimes during stopping cluster services I'm receiving clurgmgrd[6673]: <crit>
Watchdog: Daemon died, rebooting...

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:
reboot

Expected results:
i would like to have more info why it happened. I can't find any info about such
watchdog.



Additional info:
Comment 1 Lon Hohberger 2007-02-08 10:23:44 EST
That happens if rgmanager crashes.  There are a few crash-fixes coming in the
next update.

It could theoretically also happen if rgmanager isn't down and cman tells it to
die (e.g. running cman_tool leave force ...) could have this effect.
Comment 2 Lon Hohberger 2007-02-08 10:29:41 EST
Try starting rgmanager with:

ulimit -c unlimited
clurgmgrd -d

That will disable the watchdog.  Additionally, if rgmanager crashes on the way
down, it will produce a core file.  I need the core file and what version of
rgmanager you're using as well as processor architecture in order to debug this.

(The core file is most important)
Comment 3 Tomasz Jaszowski 2007-03-01 01:00:52 EST
unfortunately till now i couldn't reproduce this problem in controlled
environment... but still trying.

Could You pass me more info about clurgmgrd parameters? what exactly -d option
means? are there any other options available?

Comment 4 Lon Hohberger 2007-03-05 10:54:53 EST
-d turns on debugging and disables the internal self-monitoring "watchdog" daemon.

There aren't any other helpful options in this case.
Comment 5 Michael Hagmann 2007-03-15 08:47:21 EDT
Hi 

this is really a serious Bug. We have now at least 5 Productive Cluster who hit
this Bug. But the Problem is when we enable the Debug mode then it don't happen
again!

Also the Problem occurs more often during ore after disabling a service with
"clusvcadm -d ...".

That happen at least thrice.

Mike
Comment 6 Lon Hohberger 2007-03-15 10:47:48 EDT
Ok, I at *least* need the version of rgmanager you guys are using.
Comment 7 Michael Hagmann 2007-03-15 12:09:35 EDT
Hi

No problem I can send you any Information do you like.

This is a normal RHEL4 AS u4 / Cluster /GFS installation.

root@lilr622b:~# rpm -qa | grep rgmanager
rgmanager-1.9.54-1

Mike

Comment 8 Tomasz Jaszowski 2007-03-15 12:22:44 EDT
Hi, exactly the same system version, and we still aren't able to reproduce
problem in controlled environment (it just dying when no one is watching :) )

we are testing rgmanager-1.9.54-3.228823test now, if problem occurs I'll pass
some info

Tomek
Comment 9 Tomasz Jaszowski 2007-03-15 12:27:04 EDT
(In reply to comment #8)
> we are testing rgmanager-1.9.54-3.228823test now, if problem occurs I'll pass
> some info

PS. on production we have rgmanager-1.9.54-1, and rgmanager-1.9.54-3.228823test
on identically configured test environment
Comment 10 Michael Hagmann 2007-03-15 12:39:50 EDT
Hi I try a extensiv testing with the clusvcadm -d ??? and clusvcadm -e ??? maybe
it works and I get a crash.

for i in $(seq 1 1000) do
    clusvcadm -d $SERVICENAME
    sleep 10
    clusvcadm -e $SERVICENAME
    sleep 10
done

regards mike
Comment 11 Lon Hohberger 2007-03-16 13:29:40 EDT
The watchdog fires when the daemon crashes - ostensibly due to a segmentation
fault.  The 3.228823test package has two fixes that, if left open, could cause
this behavior.

Tomasz - with response to C#9 - the configuration between .54-0 and .54-3.228823
packages should be identical; there are no backwards-compatibility issues there  

Michael - with response to C#10 - that will eventually cause a crash due to a
race on .54, but is fixed in .54-3.228823 and the update 5 beta packages.

Could I get everyone who is on this bugzilla who is not already using
.54-3.228823 to use it?  I have a very strong suspicion that the crash causing
this symptom is fixed already.  All of the fixes in .54-3.228823 are included in
update 5.

If you need a different architecture than what is on my people page, let me know.

http://people.redhat.com/lhh/packages.html
Comment 12 Michael Hagmann 2007-03-28 15:03:26 EDT
Hi Lon

is it possible to get a Hotfix package from Red Hat Support for .54-3.228823 ?

thx mike
Comment 14 Lon Hohberger 2007-05-16 11:47:26 EDT
Sorry for the late response; this is fixed in 4.5
Comment 15 Michael Hagmann 2007-05-16 12:35:22 EDT
Hi Lon

no problem, we also received a Hotfix package from Support.

thx mike

Note You need to log in before you can comment on or make changes to this bug.