Bug 235933
Summary: | race condition on watchdog stop | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Lon Hohberger <lhh> | ||||||
Component: | clumanager | Assignee: | Lon Hohberger <lhh> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 3 | CC: | cluster-maint, ssato | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | RHBA-2007-0165 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2007-06-14 12:54:01 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Lon Hohberger
2007-04-10 21:02:53 UTC
Created attachment 152180 [details]
Patch
Fixes in CVS *** Bug 236336 has been marked as a duplicate of this bug. *** 1.2.35 build does not have the fix for this. 1.2.35.2 does have the build fix for this; though because it is nearly impossible to reproduce on demand, I only did regression testing to ensure that the patch does not break anything. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0165.html Created attachment 295147 [details]
/var/log/messages of problematic node
Hi, My customer had this problem again in clumanager-1.2.35.2-1, which should include the above fix. The customer's environment is as follows. * Packages kernel-smp-2.4.21-27.EL-i686 clumanager-1.2.35.2-1 4 nodes cluster consists of the hosts,SUC21/SUC22/SUC23/SUC24 * How it happend 1. stop clumanager in SUC21/SUC22/SUC23/SUC24 2. shutdown SUC21/SUC22/SUC23/SUC24 -> WDT〜 occured while shutting down SUC23 and forcefully rebooted. The cause of WDT 〜 problem is thought to be as follows(I think). * Normal operation when stopping clumanager 1. heartbeat thread write '\0' to watchdog 2. heartbeat thread terminated 3. watchdog is closing 4. 'V' is written to watchdog 5. watchdog is closed * WDT condition when stopping clumanager 1. heartbeat trehad write '\0' to watchdog 2. watchdog is closing 3. 'V' is written to watchdog 4. heartbeat thread write to '\0' to watchdog -> WDT device closed unexpectedly 〜 ! So the cause is that heartbeat thread write '\0' to watchdog while it is closing. Therefore, the fix in clumanager-1.2.35.2-1 is to ensure that heartbeat thread is terminated before watchdog start closing(I believe). However, it happend as follows(I'll attach /var/log/messages as an attachment) --------------------------------------------------------------------------------- Dec 29 18:10:12 SUC23 clumanager: [7318]: <notice> Shutting down Red Hat Cluster Manager... Dec 29 18:10:12 SUC23 clumanager: cluquorumd -TERM succeeded Dec 29 18:10:24 SUC23 cluquorumd[3919]: <emerg> Quorum lost during shutdown! Dec 29 18:10:24 SUC23 clulockd[3938]: <crit> Quorum Event: LOST QUORUM Dec 29 18:10:24 SUC23 clusvcmgrd[3942]: <err> Unable to obtain cluster lock: No locks available Dec 29 18:10:24 SUC23 clusvcmgrd[3942]: <err> Failed to acquire cluster lock during shutdown Dec 29 18:10:24 SUC23 clusvcmgrd[3942]: <warning> Forcing stop of service TERM1 Dec 29 18:10:24 SUC23 clusvcmgrd: [7377]: <notice> service notice: Stopping service TERM1 ... Dec 29 18:10:24 SUC23 clusvcmgrd: [7377]: <notice> service notice: Running user script '/galc_es/ope/shl/B62U240 stop' Dec 29 18:10:24 SUC23 su(pam_unix)[7408]: session opened for user root by (uid=0) Dec 29 18:10:25 SUC23 su(pam_unix)[7408]: session closed for user root Dec 29 18:10:25 SUC23 clusvcmgrd: [7377]: <notice> service notice: Stopped service TERM1 ... Dec 29 18:10:25 SUC23 clusvcmgrd[3942]: <warning> Forcing stop of service TERM2 Dec 29 18:10:25 SUC23 clusvcmgrd: [7637]: <notice> service notice: Stopping service TERM2 ... Dec 29 18:10:25 SUC23 clusvcmgrd: [7637]: <notice> service notice: Running user script '/galc_es/ope/shl/B62U270 stop' Dec 29 18:10:25 SUC23 su(pam_unix)[7668]: session opened for user root by (uid=0) Dec 29 18:10:25 SUC23 su(pam_unix)[7668]: session closed for user root Dec 29 18:10:26 SUC23 clusvcmgrd: [7637]: <notice> service notice: Stopped service TERM2 ... Dec 29 18:10:26 SUC23 clusvcmgrd[3942]: <warning> Forcing stop of service TERM3 Dec 29 18:10:26 SUC23 clusvcmgrd: [7899]: <notice> service notice: Stopping service TERM3 ... Dec 29 18:10:26 SUC23 clusvcmgrd: [7899]: <notice> service notice: Running user script '/galc_es/ope/shl/B62U300 stop' Dec 29 18:10:26 SUC23 su(pam_unix)[7930]: session opened for user root by (uid=0) Dec 29 18:10:26 SUC23 su(pam_unix)[7930]: session closed for user root Dec 29 18:10:26 SUC23 clusvcmgrd: [7899]: <notice> service notice: Stopped service TERM3 ... Dec 29 18:10:26 SUC23 cluquorumd[3919]: <err> __msg_send: Incomplete write to 6. Error: Connection reset by peer Dec 29 18:10:26 SUC23 clumanager: cluquorumd shutdown failed Dec 29 18:32:17 SUC23 gdm[5183]: gdm_child_action: Master halting... Dec 29 18:32:17 SUC23 shutdown: shutting down for system halt Dec 29 18:32:17 SUC23 init: Switching to runlevel: 0 Dec 29 18:32:18 SUC23 clumanager: [11359]: <notice> Shutting down Red Hat Cluster Manager... Dec 29 18:32:18 SUC23 clumanager: clumembd -KILL succeeded Dec 29 18:32:18 SUC23 clumanager: [11359]: <notice> Red Hat Cluster Manager is stopped. Dec 29 18:32:18 SUC23 kernel: SOFTDOG: WDT device closed unexpectedly. WDT will not stop! Dec 29 18:32:19 SUC23 clumanager: [7318]: <notice> Red Hat Cluster Manager is stopped. --------------------------------------------------------------------------------- Could you have a look? Thanks, Masahiro The symptoms of what you experienced are not related to this bugzilla (and don't actually look like a bug): Jan 26 20:16:47 SUC23 clumembd[3914]: <notice> Member M_SUC22 UP Jan 26 20:16:56 SUC23 cluquorumd[3899]: <notice> IPv4 TB @ IP_TARGET Online Jan 26 20:17:27 SUC23 clumembd[3914]: <notice> Member M_SUC24 UP Jan 26 20:17:27 SUC23 cluquorumd[3899]: <notice> Quorum Formed; Starting Service Manager ... Jan 26 20:18:30 SUC23 clumembd[3914]: <notice> Member M_SUC21 UP Jan 26 20:18:33 SUC23 ntpd[3307]: kernel time discipline status change 1 Jan 26 20:25:15 SUC23 cluquorumd[3899]: <notice> IPv4 TB @ IP_TARGET Offline ... Jan 26 20:26:33 SUC23 clumanager: [9906]: <notice> Shutting down Red Hat Cluster Manager... Jan 26 20:26:33 SUC23 clumanager: cluquorumd -TERM succeeded Jan 26 20:26:45 SUC23 cluquorumd[3899]: <emerg> Quorum lost during shutdown! Jan 26 20:26:45 SUC23 clulockd[3918]: <crit> Quorum Event: LOST QUORUM ... Dec 29 18:32:18 SUC23 clumanager: clumembd -KILL succeeded ... Dec 29 18:32:18 SUC23 kernel: SOFTDOG: WDT device closed unexpectedly. WDT will not stop! Clumembd got killed with SIGKILL. It probably never got the 'V' out. If you think this patch is missing from 1.2.35.2, please try 1.2.37. >Clumembd got killed with SIGKILL. Yes, I've noticed that. That is the main difference between the previous case and this case. I'm still not sure in what situation clumembd get killed with SIGKILL. But the system is always rebooted when I issue "kill SIGKILL <pid of clumembd>" in my test machine, which looks normal. If so, why was SIGKILL issued to clumembd in the timing? and is it possible to prevent it from happening? The customer was just shutting down clumanager. The operation of stopping clumanager shouldn't trigger WDT and forcefully reboot the system. And the other point is that the following log indicates that they failed to stop clumanager.Therefore, clumembd continued to run by the time of shutting down the system. If they've successfully stopped clumanager, clumanager obviously wouldn't havet issued SIGTERM to clumembd when the system is shutting down. Dec 29 18:10:26 SUC23 cluquorumd[3919]: <err> __msg_send: Incomplete write to 6. Error: Connection reset by peer Dec 29 18:10:26 SUC23 clumanager: cluquorumd shutdown failed ... Dec 29 18:32:17 SUC23 shutdown: shutting down for system halt >If you think this patch is missing from 1.2.35.2 The patch is included in 1.2.35.2. >please try 1.2.37 I'll ask this with DEBUG log level enabled. But the cause for the problem sounds clear at this stage(clumembd was killed with SIGKILL). Yes, If the cause is that "clumembd got killed with SIGKILL" which triggered WDT〜 and force reboot, it's different problem from this bugzilla. So I'll escalate the issue as a separate case in IT. I'd appreciate for your comment. Thanks, Masahiro >If so, why was SIGKILL issued to clumembd in the timing? and is it possible to
>prevent it from happening?
Ok, I figured out that SIGKILL to clumembd is issued by clumanager's init script
when shutting down the system.
From clumanager init script:
stop_cluster()
..
killproc $QUORUMD -TERM
..
for dmn in $MEMBD $LOCKD $SVCMGRD $QUORUMD; do
killall -9 $dmn &> /dev/null
rm -f /var/run/$dmn.pid
done
I believe that "killall -9 $dmn &> /dev/null" occures when "killproc $QUORUMD
-TERM" didn't work.
In this case, the system can be forcefully rebooted issuing "WDT〜" error in
/var/log/messages. I setup 4 nodes cluster and stopped clumanager
node1/node2/node3, then node4 lost the quorum. When I issued "killall -9 clumembd
clulockd clusvcmgrd cluquorumd >& /dev/null" in the node4, log messages look
slightly different from the customer's. But it was forcefully rebooted issuing
"WDT〜" message as follows.
---------------------------------------------------------------------------------
Feb 20 13:19:08 node4 cluquorumd[1646]: <emerg> Quorum lost!
Feb 20 13:19:08 node4 clulockd[1652]: <crit> Quorum Event: LOST QUORUM
Feb 20 13:19:08 node4 clusvcmgrd[1657]: <err> Unable to obtain cluster lock: No
locks available
Feb 20 13:19:08 node4 clusvcmgrd[1657]: <crit> Halting services due to loss of
quorum
Feb 20 13:19:08 node4 clusvcmgrd[1657]: <warning> Forcing stop of service BATCH
Feb 20 13:19:08 node4 clusvcmgrd[2573]: <info> Using normal priority
Feb 20 13:19:08 node4 clusvcmgrd: [2573]: <notice> service notice: Stopping
service BATCH ...
Feb 20 13:19:08 node4 clusvcmgrd: [2573]: <notice> service notice: Running user
script '/scripts/BATCH stop'
Feb 20 13:19:09 node4 clusvcmgrd: [2573]: <info> service info: unmounting
/dev/sdb1 (/mnt/batch)
Feb 20 13:19:09 node4 clusvcmgrd: [2573]: <notice> service notice: Stopped
service BATCH ...
Feb 20 13:19:09 node4 clusvcmgrd[1657]: <warning> Forcing stop of service DB
Feb 20 13:19:09 node4 clusvcmgrd[2729]: <info> Using normal priority
Feb 20 13:19:09 node4 clusvcmgrd: [2729]: <notice> service notice: Stopping
service DB ...
Feb 20 13:19:09 node4 clusvcmgrd: [2729]: <notice> service notice: Running user
script '/scripts/DB stop'
Feb 20 13:19:09 node4 clusvcmgrd: [2729]: <info> service info: unmounting
/dev/sdb2 (/mnt/db)
Feb 20 13:19:10 node4 clusvcmgrd: [2729]: <notice> service notice: Stopped
service DB ...
Feb 20 13:19:10 node4 clusvcmgrd[1657]: <warning> Forcing stop of service REAL
Feb 20 13:19:10 node4 clusvcmgrd[2885]: <info> Using normal priority
Feb 20 13:19:10 node4 clusvcmgrd: [2885]: <notice> service notice: Stopping
service REAL ...
Feb 20 13:19:10 node4 clusvcmgrd: [2885]: <notice> service notice: Running user
script '/scripts/REAL stop'
Feb 20 13:19:10 node4 clusvcmgrd: [2885]: <info> service info: unmounting
/dev/sdb3 (/mnt/real)
Feb 20 13:19:10 node4 clusvcmgrd: [2885]: <notice> service notice: Stopped
service REAL ...
Feb 20 13:19:10 node4 clusvcmgrd[1657]: <info> Exiting
Feb 20 13:19:10 node4 cluquorumd[1646]: <warning> Service Manager has exited
uncleanly - Not shutting down
Feb 20 13:30:55 node4 clulockd[1652]: <crit> Lost connection to cluquorumd -
Restarting
Feb 20 13:30:55 node4 clulockd[3042]: <crit> Couldn't subscribe to quorum
events: Connection refused
Feb 20 13:30:55 node4 kernel: SOFTDOG: WDT device closed unexpectedly. WDT will
not stop!
Feb 20 13:31:57 node4 syslogd 1.4.1: restart.
Feb 20 13:31:57 node4 syslog: syslogd startup succeeded
---------------------------------------------------------------------------------
Yes, this problem looks like different issue from this bugzilla. But some
modification might be needed to fix the problem since I don't think the customer
can prevent the problem by configuration.
If my assumption above is correct, the temporary fix might be "chkconfig
clumanager off". This way, stop_cluster() in clumanager script doesn't get
executed when shutting down the system. This might not be a problem for them
because they start/stop clumanager in their own script when booting/shutting
down the system.
I'll escalate this case to SEG through IT.
Thanks,
Masahiro
The above assumption is probaby incorrect. I can see some code which issue SIGTERM to clumembd. And if the init script issued SIGTERM to clumembd, I wonder if it is logged to /var/log/messages as in "clumanager: clumembd -KILL succeeded". I'll do more own research and escalate it to SEG as a separate case. Sorry about taking your time. Thanks, Masahiro |