Bug 318571 - A failed fenced cannot be recovered
Summary: A failed fenced cannot be recovered
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman
Version: 5.0
Hardware: All
OS: Linux
low
high
Target Milestone: ---
: ---
Assignee: Ryan O'Hara
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-10-04 15:41 UTC by Mark Hlawatschek
Modified: 2018-10-19 22:08 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-20 21:52:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2009:0189 0 normal SHIPPED_LIVE cman bug-fix and enhancement update 2009-01-20 16:05:55 UTC

Description Mark Hlawatschek 2007-10-04 15:41:10 UTC
Description of problem:

When a fenced process on one node in the cluster dies, the fenced and the 
fence domain cannot be recovered.


Version-Release number of selected component (if applicable):
 fenced 2.0.70

How reproducible:
Always


Steps to Reproduce:
1. start one cluster node (node1)
2. kill fenced on this node 
3. start another node (node2) and try to join fence domain.

Actual results:
-The second cluster node cannot join the fence domain.
-The first cluster node cannot leave nor rejoin the fence domain. 
Also a restart of fenced and a fence_tool join on node1 will not help.
cman_tool services shows JOIN_STOP_WAIT for the fence
 service on node1

Expected results:
In my opinion a crashed fenced should be recoverable (as in CS4) through a 
manual or automatic restart. Also if node1 leaves the cluster. the fence 
domain should be automatically recovered.

Additional info:

Comment 1 Debbie Johnson 2008-06-02 22:27:45 UTC
This type of issue is also being seen in IT#169334.  An easily duplication is
Steps to Reproduce:

1, Setup a GFS filesystem on some nodes.

2, Then kills process fenced ( killall fenced ) and stops cman-network interface
(ifdown eth1 ) on a node.

3, The node will crash and print the following messages when  some process
attempt to access the GFS filesystem( cd /gfs ):

EIP: 0060:[<f8ce2611>] CPU: 0
EIP is at do_dlm_unlock+0xa9/0xbf [lock_dlm]
EFLAGS: 00010246    Not tainted  (2.6.9-67.ELsmp)
EAX: 00000001 EBX: f621d280 ECX: f0417f28 EDX: f8ce72d3
ESI: ffffffea EDI: 00000000 EBP: f8d81000 DS: 007b ES: 007b
CR0: 8005003b CR2: 00c56a33 CR3: 37f81e00 CR4: 000006f0
[<f8ce28b2>] lm_dlm_unlock+0x14/0x1c [lock_dlm]
[<f8df1ede>] gfs_lm_unlock+0x2c/0x42 [gfs]
[<f8de7d63>] gfs_glock_drop_th+0xf3/0x12d [gfs]
[<f8de7257>] rq_demote+0x7f/0x98 [gfs]
[<f8de730e>] run_queue+0x5a/0xc1 [gfs]
[<f8de7431>] unlock_on_glock+0x1f/0x28 [gfs]
[<f8de93e9>] gfs_reclaim_glock+0xc3/0x13c [gfs]
[<f8ddbe05>] gfs_glockd+0x39/0xde [gfs]
[<c011e7b9>] default_wake_function+0x0/0xc
[<c02d8522>] ret_from_fork+0x6/0x14
[<c011e7b9>] default_wake_function+0x0/0xc
[<f8ddbdcc>] gfs_glockd+0x0/0xde [gfs]
[<c01041f5>] kernel_thread_helper+0x5/0xb


Comment 3 RHEL Program Management 2008-07-14 16:41:02 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 5 Ryan O'Hara 2008-08-26 17:01:16 UTC
The correct solution for this problem is to not make fenced recoverable, but to shutdown any node that has a cluster service die unexpectedly.

For example, if fenced dies unexpectedly without properly leaving the fence domain, we have no way of knowing if the fenced was killed intentionally or segfaulted, etc. In either case, they node needs to be removed from the cluster and fully restarted. Killing fenced (or having fenced die unexpectedly) is an invalid case.

We can detect the case where fenced dies unexpectedly. If groupd detects that fenced has died without properly leaving the fence domain first (fence_tool leave), then we should call cman_shutdown(..) to force the node to leave the cluster. This will cause the failed node to be fenced, which is correct behavior. Note that once fenced has been killed it is not possible to correctly remove that node from the fence domain and continue.

We can extend this behavior to other groupd services, including gfs_controld and dlm_controld. So in the end, if any of these daemons (fenced, gfs_controld or dlm_controld) die unexpectedly, we will detect this case in groupd and shutdown/fence that node.

Patch is complete and being tested before commit.

Comment 8 Ryan O'Hara 2008-09-09 15:49:53 UTC
Fixed in RHEL5.

A failed daemon (fenced, dlm_controld, gfs_controld) will cause the node to be removed from the cluster. In any of these daemons die unexpectedly, groupd will detect the failure are remove the node from the cluster via cman_leave_cluster(). The reason that we must remove the node from the cluster is that a failed daemon results in the cluster being in an invalid state. The only safe thing to do is remove the node from the cluster. A message is logged reporting which daemon appears to have died unexpectedly, then the node is forced to leave the cluster.

This behavior is enabled by default, but it can be turned on/off with the -s option to groupd (groupd -s [0|1], where 0 will disable this "shutdown" behavior and 1 will enable it).

Comment 11 errata-xmlrpc 2009-01-20 21:52:47 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0189.html


Note You need to log in before you can comment on or make changes to this bug.