Bug 235933

Summary:

race condition on watchdog stop

Product:

[Retired] Red Hat Cluster Suite

Reporter:

Lon Hohberger <lhh>

Component:

clumanager

Assignee:

Lon Hohberger <lhh>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

CC:

cluster-maint, ssato

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

RHBA-2007-0165

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2007-06-14 12:54:01 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Patch	none
/var/log/messages of problematic node	none

Description Lon Hohberger 2007-04-10 21:02:53 UTC

Description of problem:

There's a race between the time the heartbeat thread terminates and when the
main thread exits, allowing the watchdog to fire when it shouldn't, causing a
reboot.


Version-Release number of selected component (if applicable): all releases

How reproducible: Sometimes (rarely?)

Steps to Reproduce:
1. Start clumembd
2. Stop clumembd
3. wait 15 seconds
4. Goto 1
  
Actual results: Machine can reboot in some cases

Expected results: No a reboot.

Additional info:

Patch from: Satoru Satoh

Comment 1 Lon Hohberger 2007-04-10 21:02:54 UTC

Created attachment 152180 [details]
Patch

Comment 2 Lon Hohberger 2007-04-12 17:47:40 UTC

Fixes in CVS

Comment 5 Satoru SATOH 2007-04-18 01:23:42 UTC

*** Bug 236336 has been marked as a duplicate of this bug. ***

Comment 7 Lon Hohberger 2007-05-21 19:46:37 UTC

1.2.35 build does not have the fix for this.

Comment 8 Lon Hohberger 2007-05-21 21:19:53 UTC

1.2.35.2 does have the build fix for this; though because it is nearly
impossible to reproduce on demand, I only did regression testing to ensure that
the patch does not break anything.

Comment 10 Red Hat Bugzilla 2007-06-14 12:54:01 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0165.html

Comment 12 Masahiro Okubo 2008-02-18 09:13:51 UTC

Created attachment 295147 [details]
/var/log/messages of problematic node

Comment 13 Masahiro Okubo 2008-02-18 09:17:48 UTC

Hi,

My customer had this problem again in clumanager-1.2.35.2-1, which should
include the above fix. The customer's environment is as follows.

* Packages
kernel-smp-2.4.21-27.EL-i686
clumanager-1.2.35.2-1

4 nodes cluster consists of the hosts,SUC21/SUC22/SUC23/SUC24


* How it happend
1. stop clumanager in SUC21/SUC22/SUC23/SUC24
2. shutdown SUC21/SUC22/SUC23/SUC24
-> WDT〜 occured while shutting down SUC23 and forcefully rebooted.

The cause of WDT 〜 problem is thought to be as follows(I think).

* Normal operation when stopping clumanager
1. heartbeat thread write '\0' to watchdog
2. heartbeat thread terminated
3. watchdog is closing
4. 'V' is written to watchdog   
5. watchdog is closed

* WDT condition when stopping clumanager
1. heartbeat trehad write '\0' to watchdog
2. watchdog is closing
3. 'V' is written to watchdog
4. heartbeat thread write to '\0' to watchdog
-> WDT device closed unexpectedly 〜 !

So the cause is that heartbeat thread write '\0' to watchdog while it is closing.
Therefore, the fix in clumanager-1.2.35.2-1 is to ensure that heartbeat thread
is terminated before watchdog start closing(I believe).


However, it happend as follows(I'll attach /var/log/messages as an attachment)

---------------------------------------------------------------------------------
Dec 29 18:10:12 SUC23 clumanager: [7318]: <notice> Shutting down Red Hat Cluster
Manager...
Dec 29 18:10:12 SUC23 clumanager: cluquorumd -TERM succeeded
Dec 29 18:10:24 SUC23 cluquorumd[3919]: <emerg> Quorum lost during shutdown!
Dec 29 18:10:24 SUC23 clulockd[3938]: <crit> Quorum Event: LOST QUORUM
Dec 29 18:10:24 SUC23 clusvcmgrd[3942]: <err> Unable to obtain cluster lock: No
locks available
Dec 29 18:10:24 SUC23 clusvcmgrd[3942]: <err> Failed to acquire cluster lock
during shutdown
Dec 29 18:10:24 SUC23 clusvcmgrd[3942]: <warning> Forcing stop of service TERM1
Dec 29 18:10:24 SUC23 clusvcmgrd: [7377]: <notice> service notice: Stopping
service TERM1 ...
Dec 29 18:10:24 SUC23 clusvcmgrd: [7377]: <notice> service notice: Running user
script '/galc_es/ope/shl/B62U240 stop'
Dec 29 18:10:24 SUC23 su(pam_unix)[7408]: session opened for user root by (uid=0)
Dec 29 18:10:25 SUC23 su(pam_unix)[7408]: session closed for user root
Dec 29 18:10:25 SUC23 clusvcmgrd: [7377]: <notice> service notice: Stopped
service TERM1 ...
Dec 29 18:10:25 SUC23 clusvcmgrd[3942]: <warning> Forcing stop of service TERM2
Dec 29 18:10:25 SUC23 clusvcmgrd: [7637]: <notice> service notice: Stopping
service TERM2 ...
Dec 29 18:10:25 SUC23 clusvcmgrd: [7637]: <notice> service notice: Running user
script '/galc_es/ope/shl/B62U270 stop'
Dec 29 18:10:25 SUC23 su(pam_unix)[7668]: session opened for user root by (uid=0)
Dec 29 18:10:25 SUC23 su(pam_unix)[7668]: session closed for user root
Dec 29 18:10:26 SUC23 clusvcmgrd: [7637]: <notice> service notice: Stopped
service TERM2 ...
Dec 29 18:10:26 SUC23 clusvcmgrd[3942]: <warning> Forcing stop of service TERM3
Dec 29 18:10:26 SUC23 clusvcmgrd: [7899]: <notice> service notice: Stopping
service TERM3 ...
Dec 29 18:10:26 SUC23 clusvcmgrd: [7899]: <notice> service notice: Running user
script '/galc_es/ope/shl/B62U300 stop'
Dec 29 18:10:26 SUC23 su(pam_unix)[7930]: session opened for user root by (uid=0)
Dec 29 18:10:26 SUC23 su(pam_unix)[7930]: session closed for user root
Dec 29 18:10:26 SUC23 clusvcmgrd: [7899]: <notice> service notice: Stopped
service TERM3 ...
Dec 29 18:10:26 SUC23 cluquorumd[3919]: <err> __msg_send: Incomplete write to 6.
Error: Connection reset by peer
Dec 29 18:10:26 SUC23 clumanager: cluquorumd shutdown failed
Dec 29 18:32:17 SUC23 gdm[5183]: gdm_child_action: Master halting...
Dec 29 18:32:17 SUC23 shutdown: shutting down for system halt
Dec 29 18:32:17 SUC23 init: Switching to runlevel: 0
Dec 29 18:32:18 SUC23 clumanager: [11359]: <notice> Shutting down Red Hat
Cluster Manager...
Dec 29 18:32:18 SUC23 clumanager: clumembd -KILL succeeded
Dec 29 18:32:18 SUC23 clumanager: [11359]: <notice> Red Hat Cluster Manager is
stopped.
Dec 29 18:32:18 SUC23 kernel: SOFTDOG: WDT device closed unexpectedly.  WDT will
not stop!
Dec 29 18:32:19 SUC23 clumanager: [7318]: <notice> Red Hat Cluster Manager is
stopped.
---------------------------------------------------------------------------------

Could you have a look? 

Thanks,

Masahiro

Comment 14 Lon Hohberger 2008-02-18 15:11:54 UTC

The symptoms of what you experienced are not related to this bugzilla (and don't
actually look like a bug):

Jan 26 20:16:47 SUC23 clumembd[3914]: <notice> Member M_SUC22 UP 
Jan 26 20:16:56 SUC23 cluquorumd[3899]: <notice> IPv4 TB @ IP_TARGET Online 
Jan 26 20:17:27 SUC23 clumembd[3914]: <notice> Member M_SUC24 UP
Jan 26 20:17:27 SUC23 cluquorumd[3899]: <notice> Quorum Formed; Starting Service
Manager 
...
Jan 26 20:18:30 SUC23 clumembd[3914]: <notice> Member M_SUC21 UP 
Jan 26 20:18:33 SUC23 ntpd[3307]: kernel time discipline status change 1
Jan 26 20:25:15 SUC23 cluquorumd[3899]: <notice> IPv4 TB @ IP_TARGET Offline 
...
Jan 26 20:26:33 SUC23 clumanager: [9906]: <notice> Shutting down Red Hat Cluster
Manager... 
Jan 26 20:26:33 SUC23 clumanager: cluquorumd -TERM succeeded
Jan 26 20:26:45 SUC23 cluquorumd[3899]: <emerg> Quorum lost during shutdown!
Jan 26 20:26:45 SUC23 clulockd[3918]: <crit> Quorum Event: LOST QUORUM 
...
Dec 29 18:32:18 SUC23 clumanager: clumembd -KILL succeeded
...
Dec 29 18:32:18 SUC23 kernel: SOFTDOG: WDT device closed unexpectedly.  WDT will
not stop!


Clumembd got killed with SIGKILL.  It probably never got the 'V' out.  If you
think this patch is missing from 1.2.35.2, please try 1.2.37.

Comment 15 Masahiro Okubo 2008-02-19 03:33:52 UTC

>Clumembd got killed with SIGKILL.

Yes, I've noticed that. That is the main difference between the previous case
and this case. I'm still not sure in what situation clumembd get killed with
SIGKILL.
But the system is always rebooted when I issue "kill SIGKILL <pid of clumembd>"
in my test machine, which looks normal.

If so, why was SIGKILL issued to clumembd in the timing? and is it possible to
prevent it from happening?  The customer was just shutting down clumanager. The
operation of stopping clumanager shouldn't trigger WDT and forcefully reboot the
system.

And the other point is that the following log indicates that they failed to stop
clumanager.Therefore, clumembd continued to run by the time of shutting down the
system. If they've successfully stopped clumanager, clumanager obviously
wouldn't havet issued SIGTERM to clumembd when the system is shutting down.

Dec 29 18:10:26 SUC23 cluquorumd[3919]: <err> __msg_send: Incomplete write to 6.
Error: Connection reset by peer 
Dec 29 18:10:26 SUC23 clumanager: cluquorumd shutdown failed
...
Dec 29 18:32:17 SUC23 shutdown: shutting down for system halt


>If you think this patch is missing from 1.2.35.2

The patch is included in 1.2.35.2. 


>please try 1.2.37

I'll ask this with DEBUG log level enabled. But the cause for the problem sounds
clear at this stage(clumembd was killed with SIGKILL). 

Yes, If the cause is that "clumembd got killed with SIGKILL" which triggered
WDT〜 and force reboot, it's different problem from this bugzilla. So I'll
escalate the issue as a separate case in IT.

I'd appreciate for your comment.

Thanks,

Masahiro

Comment 16 Masahiro Okubo 2008-02-20 04:14:12 UTC

>If so, why was SIGKILL issued to clumembd in the timing? and is it possible to
>prevent it from happening? 

Ok, I figured out that SIGKILL to clumembd is issued by clumanager's init script
when shutting down the system.

From clumanager init script:

stop_cluster()
..
killproc $QUORUMD -TERM
..
for dmn in $MEMBD $LOCKD $SVCMGRD $QUORUMD; do
    killall -9 $dmn &> /dev/null
    rm -f /var/run/$dmn.pid
done

I believe that "killall -9 $dmn &> /dev/null" occures when "killproc $QUORUMD
-TERM" didn't work.

In this case, the system can be forcefully rebooted issuing "WDT〜" error in
/var/log/messages. I setup 4 nodes cluster and stopped clumanager
node1/node2/node3, then node4 lost the quorum. When I issued "killall -9 clumembd
clulockd clusvcmgrd cluquorumd >& /dev/null" in the node4, log messages look
slightly different from the customer's. But it was forcefully rebooted issuing
"WDT〜" message as follows.

---------------------------------------------------------------------------------
Feb 20 13:19:08 node4 cluquorumd[1646]: <emerg> Quorum lost!
Feb 20 13:19:08 node4 clulockd[1652]: <crit> Quorum Event: LOST QUORUM
Feb 20 13:19:08 node4 clusvcmgrd[1657]: <err> Unable to obtain cluster lock: No
locks available
Feb 20 13:19:08 node4 clusvcmgrd[1657]: <crit> Halting services due to loss of
quorum
Feb 20 13:19:08 node4 clusvcmgrd[1657]: <warning> Forcing stop of service BATCH
Feb 20 13:19:08 node4 clusvcmgrd[2573]: <info> Using normal priority
Feb 20 13:19:08 node4 clusvcmgrd: [2573]: <notice> service notice: Stopping
service BATCH ...
Feb 20 13:19:08 node4 clusvcmgrd: [2573]: <notice> service notice: Running user
script '/scripts/BATCH stop'
Feb 20 13:19:09 node4 clusvcmgrd: [2573]: <info> service info: unmounting
/dev/sdb1 (/mnt/batch)
Feb 20 13:19:09 node4 clusvcmgrd: [2573]: <notice> service notice: Stopped
service BATCH ...
Feb 20 13:19:09 node4 clusvcmgrd[1657]: <warning> Forcing stop of service DB
Feb 20 13:19:09 node4 clusvcmgrd[2729]: <info> Using normal priority
Feb 20 13:19:09 node4 clusvcmgrd: [2729]: <notice> service notice: Stopping
service DB ...
Feb 20 13:19:09 node4 clusvcmgrd: [2729]: <notice> service notice: Running user
script '/scripts/DB stop'
Feb 20 13:19:09 node4 clusvcmgrd: [2729]: <info> service info: unmounting
/dev/sdb2 (/mnt/db)
Feb 20 13:19:10 node4 clusvcmgrd: [2729]: <notice> service notice: Stopped
service DB ...
Feb 20 13:19:10 node4 clusvcmgrd[1657]: <warning> Forcing stop of service REAL
Feb 20 13:19:10 node4 clusvcmgrd[2885]: <info> Using normal priority
Feb 20 13:19:10 node4 clusvcmgrd: [2885]: <notice> service notice: Stopping
service REAL ...
Feb 20 13:19:10 node4 clusvcmgrd: [2885]: <notice> service notice: Running user
script '/scripts/REAL stop'
Feb 20 13:19:10 node4 clusvcmgrd: [2885]: <info> service info: unmounting
/dev/sdb3 (/mnt/real)
Feb 20 13:19:10 node4 clusvcmgrd: [2885]: <notice> service notice: Stopped
service REAL ...
Feb 20 13:19:10 node4 clusvcmgrd[1657]: <info> Exiting
Feb 20 13:19:10 node4 cluquorumd[1646]: <warning> Service Manager has exited
uncleanly - Not shutting down
Feb 20 13:30:55 node4 clulockd[1652]: <crit> Lost connection to cluquorumd -
Restarting
Feb 20 13:30:55 node4 clulockd[3042]: <crit> Couldn't subscribe to quorum
events: Connection refused
Feb 20 13:30:55 node4 kernel: SOFTDOG: WDT device closed unexpectedly.  WDT will
not stop!
Feb 20 13:31:57 node4 syslogd 1.4.1: restart.
Feb 20 13:31:57 node4 syslog: syslogd startup succeeded
---------------------------------------------------------------------------------

Yes, this problem looks like different issue from this bugzilla. But some
modification might be needed to fix the problem since I don't think the customer
can prevent the problem by configuration.

If my assumption above is correct, the temporary fix might be "chkconfig 
clumanager off". This way, stop_cluster() in clumanager script doesn't get 
executed when shutting down the system. This might not be a problem for them 
because they start/stop clumanager in their own script when booting/shutting
down the system.


I'll escalate this case to SEG through IT. 

Thanks,


Masahiro

Comment 17 Masahiro Okubo 2008-02-20 08:41:25 UTC

The above assumption is probaby incorrect. I can see some code which issue
SIGTERM to clumembd. And if the init script issued SIGTERM to clumembd, I wonder
if it is logged to /var/log/messages as in "clumanager: clumembd -KILL succeeded".

I'll do more own research and escalate it to SEG as a separate case.

Sorry about taking your time.

Thanks,

Masahiro