Description of problem: There's a race between the time the heartbeat thread terminates and when the main thread exits, allowing the watchdog to fire when it shouldn't, causing a reboot. Version-Release number of selected component (if applicable): all releases How reproducible: Sometimes (rarely?) Steps to Reproduce: 1. Start clumembd 2. Stop clumembd 3. wait 15 seconds 4. Goto 1 Actual results: Machine can reboot in some cases Expected results: No a reboot. Additional info: Patch from: Satoru Satoh
Created attachment 152180 [details] Patch
Fixes in CVS
*** Bug 236336 has been marked as a duplicate of this bug. ***
1.2.35 build does not have the fix for this.
1.2.35.2 does have the build fix for this; though because it is nearly impossible to reproduce on demand, I only did regression testing to ensure that the patch does not break anything.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0165.html
Created attachment 295147 [details] /var/log/messages of problematic node
Hi, My customer had this problem again in clumanager-1.2.35.2-1, which should include the above fix. The customer's environment is as follows. * Packages kernel-smp-2.4.21-27.EL-i686 clumanager-1.2.35.2-1 4 nodes cluster consists of the hosts,SUC21/SUC22/SUC23/SUC24 * How it happend 1. stop clumanager in SUC21/SUC22/SUC23/SUC24 2. shutdown SUC21/SUC22/SUC23/SUC24 -> WDT〜 occured while shutting down SUC23 and forcefully rebooted. The cause of WDT 〜 problem is thought to be as follows(I think). * Normal operation when stopping clumanager 1. heartbeat thread write '\0' to watchdog 2. heartbeat thread terminated 3. watchdog is closing 4. 'V' is written to watchdog 5. watchdog is closed * WDT condition when stopping clumanager 1. heartbeat trehad write '\0' to watchdog 2. watchdog is closing 3. 'V' is written to watchdog 4. heartbeat thread write to '\0' to watchdog -> WDT device closed unexpectedly 〜 ! So the cause is that heartbeat thread write '\0' to watchdog while it is closing. Therefore, the fix in clumanager-1.2.35.2-1 is to ensure that heartbeat thread is terminated before watchdog start closing(I believe). However, it happend as follows(I'll attach /var/log/messages as an attachment) --------------------------------------------------------------------------------- Dec 29 18:10:12 SUC23 clumanager: [7318]: <notice> Shutting down Red Hat Cluster Manager... Dec 29 18:10:12 SUC23 clumanager: cluquorumd -TERM succeeded Dec 29 18:10:24 SUC23 cluquorumd[3919]: <emerg> Quorum lost during shutdown! Dec 29 18:10:24 SUC23 clulockd[3938]: <crit> Quorum Event: LOST QUORUM Dec 29 18:10:24 SUC23 clusvcmgrd[3942]: <err> Unable to obtain cluster lock: No locks available Dec 29 18:10:24 SUC23 clusvcmgrd[3942]: <err> Failed to acquire cluster lock during shutdown Dec 29 18:10:24 SUC23 clusvcmgrd[3942]: <warning> Forcing stop of service TERM1 Dec 29 18:10:24 SUC23 clusvcmgrd: [7377]: <notice> service notice: Stopping service TERM1 ... Dec 29 18:10:24 SUC23 clusvcmgrd: [7377]: <notice> service notice: Running user script '/galc_es/ope/shl/B62U240 stop' Dec 29 18:10:24 SUC23 su(pam_unix)[7408]: session opened for user root by (uid=0) Dec 29 18:10:25 SUC23 su(pam_unix)[7408]: session closed for user root Dec 29 18:10:25 SUC23 clusvcmgrd: [7377]: <notice> service notice: Stopped service TERM1 ... Dec 29 18:10:25 SUC23 clusvcmgrd[3942]: <warning> Forcing stop of service TERM2 Dec 29 18:10:25 SUC23 clusvcmgrd: [7637]: <notice> service notice: Stopping service TERM2 ... Dec 29 18:10:25 SUC23 clusvcmgrd: [7637]: <notice> service notice: Running user script '/galc_es/ope/shl/B62U270 stop' Dec 29 18:10:25 SUC23 su(pam_unix)[7668]: session opened for user root by (uid=0) Dec 29 18:10:25 SUC23 su(pam_unix)[7668]: session closed for user root Dec 29 18:10:26 SUC23 clusvcmgrd: [7637]: <notice> service notice: Stopped service TERM2 ... Dec 29 18:10:26 SUC23 clusvcmgrd[3942]: <warning> Forcing stop of service TERM3 Dec 29 18:10:26 SUC23 clusvcmgrd: [7899]: <notice> service notice: Stopping service TERM3 ... Dec 29 18:10:26 SUC23 clusvcmgrd: [7899]: <notice> service notice: Running user script '/galc_es/ope/shl/B62U300 stop' Dec 29 18:10:26 SUC23 su(pam_unix)[7930]: session opened for user root by (uid=0) Dec 29 18:10:26 SUC23 su(pam_unix)[7930]: session closed for user root Dec 29 18:10:26 SUC23 clusvcmgrd: [7899]: <notice> service notice: Stopped service TERM3 ... Dec 29 18:10:26 SUC23 cluquorumd[3919]: <err> __msg_send: Incomplete write to 6. Error: Connection reset by peer Dec 29 18:10:26 SUC23 clumanager: cluquorumd shutdown failed Dec 29 18:32:17 SUC23 gdm[5183]: gdm_child_action: Master halting... Dec 29 18:32:17 SUC23 shutdown: shutting down for system halt Dec 29 18:32:17 SUC23 init: Switching to runlevel: 0 Dec 29 18:32:18 SUC23 clumanager: [11359]: <notice> Shutting down Red Hat Cluster Manager... Dec 29 18:32:18 SUC23 clumanager: clumembd -KILL succeeded Dec 29 18:32:18 SUC23 clumanager: [11359]: <notice> Red Hat Cluster Manager is stopped. Dec 29 18:32:18 SUC23 kernel: SOFTDOG: WDT device closed unexpectedly. WDT will not stop! Dec 29 18:32:19 SUC23 clumanager: [7318]: <notice> Red Hat Cluster Manager is stopped. --------------------------------------------------------------------------------- Could you have a look? Thanks, Masahiro
The symptoms of what you experienced are not related to this bugzilla (and don't actually look like a bug): Jan 26 20:16:47 SUC23 clumembd[3914]: <notice> Member M_SUC22 UP Jan 26 20:16:56 SUC23 cluquorumd[3899]: <notice> IPv4 TB @ IP_TARGET Online Jan 26 20:17:27 SUC23 clumembd[3914]: <notice> Member M_SUC24 UP Jan 26 20:17:27 SUC23 cluquorumd[3899]: <notice> Quorum Formed; Starting Service Manager ... Jan 26 20:18:30 SUC23 clumembd[3914]: <notice> Member M_SUC21 UP Jan 26 20:18:33 SUC23 ntpd[3307]: kernel time discipline status change 1 Jan 26 20:25:15 SUC23 cluquorumd[3899]: <notice> IPv4 TB @ IP_TARGET Offline ... Jan 26 20:26:33 SUC23 clumanager: [9906]: <notice> Shutting down Red Hat Cluster Manager... Jan 26 20:26:33 SUC23 clumanager: cluquorumd -TERM succeeded Jan 26 20:26:45 SUC23 cluquorumd[3899]: <emerg> Quorum lost during shutdown! Jan 26 20:26:45 SUC23 clulockd[3918]: <crit> Quorum Event: LOST QUORUM ... Dec 29 18:32:18 SUC23 clumanager: clumembd -KILL succeeded ... Dec 29 18:32:18 SUC23 kernel: SOFTDOG: WDT device closed unexpectedly. WDT will not stop! Clumembd got killed with SIGKILL. It probably never got the 'V' out. If you think this patch is missing from 1.2.35.2, please try 1.2.37.
>Clumembd got killed with SIGKILL. Yes, I've noticed that. That is the main difference between the previous case and this case. I'm still not sure in what situation clumembd get killed with SIGKILL. But the system is always rebooted when I issue "kill SIGKILL <pid of clumembd>" in my test machine, which looks normal. If so, why was SIGKILL issued to clumembd in the timing? and is it possible to prevent it from happening? The customer was just shutting down clumanager. The operation of stopping clumanager shouldn't trigger WDT and forcefully reboot the system. And the other point is that the following log indicates that they failed to stop clumanager.Therefore, clumembd continued to run by the time of shutting down the system. If they've successfully stopped clumanager, clumanager obviously wouldn't havet issued SIGTERM to clumembd when the system is shutting down. Dec 29 18:10:26 SUC23 cluquorumd[3919]: <err> __msg_send: Incomplete write to 6. Error: Connection reset by peer Dec 29 18:10:26 SUC23 clumanager: cluquorumd shutdown failed ... Dec 29 18:32:17 SUC23 shutdown: shutting down for system halt >If you think this patch is missing from 1.2.35.2 The patch is included in 1.2.35.2. >please try 1.2.37 I'll ask this with DEBUG log level enabled. But the cause for the problem sounds clear at this stage(clumembd was killed with SIGKILL). Yes, If the cause is that "clumembd got killed with SIGKILL" which triggered WDT〜 and force reboot, it's different problem from this bugzilla. So I'll escalate the issue as a separate case in IT. I'd appreciate for your comment. Thanks, Masahiro
>If so, why was SIGKILL issued to clumembd in the timing? and is it possible to >prevent it from happening? Ok, I figured out that SIGKILL to clumembd is issued by clumanager's init script when shutting down the system. From clumanager init script: stop_cluster() .. killproc $QUORUMD -TERM .. for dmn in $MEMBD $LOCKD $SVCMGRD $QUORUMD; do killall -9 $dmn &> /dev/null rm -f /var/run/$dmn.pid done I believe that "killall -9 $dmn &> /dev/null" occures when "killproc $QUORUMD -TERM" didn't work. In this case, the system can be forcefully rebooted issuing "WDT〜" error in /var/log/messages. I setup 4 nodes cluster and stopped clumanager node1/node2/node3, then node4 lost the quorum. When I issued "killall -9 clumembd clulockd clusvcmgrd cluquorumd >& /dev/null" in the node4, log messages look slightly different from the customer's. But it was forcefully rebooted issuing "WDT〜" message as follows. --------------------------------------------------------------------------------- Feb 20 13:19:08 node4 cluquorumd[1646]: <emerg> Quorum lost! Feb 20 13:19:08 node4 clulockd[1652]: <crit> Quorum Event: LOST QUORUM Feb 20 13:19:08 node4 clusvcmgrd[1657]: <err> Unable to obtain cluster lock: No locks available Feb 20 13:19:08 node4 clusvcmgrd[1657]: <crit> Halting services due to loss of quorum Feb 20 13:19:08 node4 clusvcmgrd[1657]: <warning> Forcing stop of service BATCH Feb 20 13:19:08 node4 clusvcmgrd[2573]: <info> Using normal priority Feb 20 13:19:08 node4 clusvcmgrd: [2573]: <notice> service notice: Stopping service BATCH ... Feb 20 13:19:08 node4 clusvcmgrd: [2573]: <notice> service notice: Running user script '/scripts/BATCH stop' Feb 20 13:19:09 node4 clusvcmgrd: [2573]: <info> service info: unmounting /dev/sdb1 (/mnt/batch) Feb 20 13:19:09 node4 clusvcmgrd: [2573]: <notice> service notice: Stopped service BATCH ... Feb 20 13:19:09 node4 clusvcmgrd[1657]: <warning> Forcing stop of service DB Feb 20 13:19:09 node4 clusvcmgrd[2729]: <info> Using normal priority Feb 20 13:19:09 node4 clusvcmgrd: [2729]: <notice> service notice: Stopping service DB ... Feb 20 13:19:09 node4 clusvcmgrd: [2729]: <notice> service notice: Running user script '/scripts/DB stop' Feb 20 13:19:09 node4 clusvcmgrd: [2729]: <info> service info: unmounting /dev/sdb2 (/mnt/db) Feb 20 13:19:10 node4 clusvcmgrd: [2729]: <notice> service notice: Stopped service DB ... Feb 20 13:19:10 node4 clusvcmgrd[1657]: <warning> Forcing stop of service REAL Feb 20 13:19:10 node4 clusvcmgrd[2885]: <info> Using normal priority Feb 20 13:19:10 node4 clusvcmgrd: [2885]: <notice> service notice: Stopping service REAL ... Feb 20 13:19:10 node4 clusvcmgrd: [2885]: <notice> service notice: Running user script '/scripts/REAL stop' Feb 20 13:19:10 node4 clusvcmgrd: [2885]: <info> service info: unmounting /dev/sdb3 (/mnt/real) Feb 20 13:19:10 node4 clusvcmgrd: [2885]: <notice> service notice: Stopped service REAL ... Feb 20 13:19:10 node4 clusvcmgrd[1657]: <info> Exiting Feb 20 13:19:10 node4 cluquorumd[1646]: <warning> Service Manager has exited uncleanly - Not shutting down Feb 20 13:30:55 node4 clulockd[1652]: <crit> Lost connection to cluquorumd - Restarting Feb 20 13:30:55 node4 clulockd[3042]: <crit> Couldn't subscribe to quorum events: Connection refused Feb 20 13:30:55 node4 kernel: SOFTDOG: WDT device closed unexpectedly. WDT will not stop! Feb 20 13:31:57 node4 syslogd 1.4.1: restart. Feb 20 13:31:57 node4 syslog: syslogd startup succeeded --------------------------------------------------------------------------------- Yes, this problem looks like different issue from this bugzilla. But some modification might be needed to fix the problem since I don't think the customer can prevent the problem by configuration. If my assumption above is correct, the temporary fix might be "chkconfig clumanager off". This way, stop_cluster() in clumanager script doesn't get executed when shutting down the system. This might not be a problem for them because they start/stop clumanager in their own script when booting/shutting down the system. I'll escalate this case to SEG through IT. Thanks, Masahiro
The above assumption is probaby incorrect. I can see some code which issue SIGTERM to clumembd. And if the init script issued SIGTERM to clumembd, I wonder if it is logged to /var/log/messages as in "clumanager: clumembd -KILL succeeded". I'll do more own research and escalate it to SEG as a separate case. Sorry about taking your time. Thanks, Masahiro