Description of problem: sometimes during stopping cluster services I'm receiving clurgmgrd[6673]: <crit> Watchdog: Daemon died, rebooting... Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: reboot Expected results: i would like to have more info why it happened. I can't find any info about such watchdog. Additional info:
That happens if rgmanager crashes. There are a few crash-fixes coming in the next update. It could theoretically also happen if rgmanager isn't down and cman tells it to die (e.g. running cman_tool leave force ...) could have this effect.
Try starting rgmanager with: ulimit -c unlimited clurgmgrd -d That will disable the watchdog. Additionally, if rgmanager crashes on the way down, it will produce a core file. I need the core file and what version of rgmanager you're using as well as processor architecture in order to debug this. (The core file is most important)
unfortunately till now i couldn't reproduce this problem in controlled environment... but still trying. Could You pass me more info about clurgmgrd parameters? what exactly -d option means? are there any other options available?
-d turns on debugging and disables the internal self-monitoring "watchdog" daemon. There aren't any other helpful options in this case.
Hi this is really a serious Bug. We have now at least 5 Productive Cluster who hit this Bug. But the Problem is when we enable the Debug mode then it don't happen again! Also the Problem occurs more often during ore after disabling a service with "clusvcadm -d ...". That happen at least thrice. Mike
Ok, I at *least* need the version of rgmanager you guys are using.
Hi No problem I can send you any Information do you like. This is a normal RHEL4 AS u4 / Cluster /GFS installation. root@lilr622b:~# rpm -qa | grep rgmanager rgmanager-1.9.54-1 Mike
Hi, exactly the same system version, and we still aren't able to reproduce problem in controlled environment (it just dying when no one is watching :) ) we are testing rgmanager-1.9.54-3.228823test now, if problem occurs I'll pass some info Tomek
(In reply to comment #8) > we are testing rgmanager-1.9.54-3.228823test now, if problem occurs I'll pass > some info PS. on production we have rgmanager-1.9.54-1, and rgmanager-1.9.54-3.228823test on identically configured test environment
Hi I try a extensiv testing with the clusvcadm -d ??? and clusvcadm -e ??? maybe it works and I get a crash. for i in $(seq 1 1000) do clusvcadm -d $SERVICENAME sleep 10 clusvcadm -e $SERVICENAME sleep 10 done regards mike
The watchdog fires when the daemon crashes - ostensibly due to a segmentation fault. The 3.228823test package has two fixes that, if left open, could cause this behavior. Tomasz - with response to C#9 - the configuration between .54-0 and .54-3.228823 packages should be identical; there are no backwards-compatibility issues there Michael - with response to C#10 - that will eventually cause a crash due to a race on .54, but is fixed in .54-3.228823 and the update 5 beta packages. Could I get everyone who is on this bugzilla who is not already using .54-3.228823 to use it? I have a very strong suspicion that the crash causing this symptom is fixed already. All of the fixes in .54-3.228823 are included in update 5. If you need a different architecture than what is on my people page, let me know. http://people.redhat.com/lhh/packages.html
Hi Lon is it possible to get a Hotfix package from Red Hat Support for .54-3.228823 ? thx mike
Sorry for the late response; this is fixed in 4.5
Hi Lon no problem, we also received a Hotfix package from Support. thx mike