Bug 318571 - A failed fenced cannot be recovered
A failed fenced cannot be recovered
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman (Show other bugs)
5.0
All Linux
low Severity high
: ---
: ---
Assigned To: Ryan O'Hara
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-10-04 11:41 EDT by Mark Hlawatschek
Modified: 2010-10-22 15:13 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-01-20 16:52:47 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Mark Hlawatschek 2007-10-04 11:41:10 EDT
Description of problem:

When a fenced process on one node in the cluster dies, the fenced and the 
fence domain cannot be recovered.


Version-Release number of selected component (if applicable):
 fenced 2.0.70

How reproducible:
Always


Steps to Reproduce:
1. start one cluster node (node1)
2. kill fenced on this node 
3. start another node (node2) and try to join fence domain.

Actual results:
-The second cluster node cannot join the fence domain.
-The first cluster node cannot leave nor rejoin the fence domain. 
Also a restart of fenced and a fence_tool join on node1 will not help.
cman_tool services shows JOIN_STOP_WAIT for the fence
 service on node1

Expected results:
In my opinion a crashed fenced should be recoverable (as in CS4) through a 
manual or automatic restart. Also if node1 leaves the cluster. the fence 
domain should be automatically recovered.

Additional info:
Comment 1 Debbie Johnson 2008-06-02 18:27:45 EDT
This type of issue is also being seen in IT#169334.  An easily duplication is
Steps to Reproduce:

1, Setup a GFS filesystem on some nodes.

2, Then kills process fenced ( killall fenced ) and stops cman-network interface
(ifdown eth1 ) on a node.

3, The node will crash and print the following messages when  some process
attempt to access the GFS filesystem( cd /gfs ):

EIP: 0060:[<f8ce2611>] CPU: 0
EIP is at do_dlm_unlock+0xa9/0xbf [lock_dlm]
EFLAGS: 00010246    Not tainted  (2.6.9-67.ELsmp)
EAX: 00000001 EBX: f621d280 ECX: f0417f28 EDX: f8ce72d3
ESI: ffffffea EDI: 00000000 EBP: f8d81000 DS: 007b ES: 007b
CR0: 8005003b CR2: 00c56a33 CR3: 37f81e00 CR4: 000006f0
[<f8ce28b2>] lm_dlm_unlock+0x14/0x1c [lock_dlm]
[<f8df1ede>] gfs_lm_unlock+0x2c/0x42 [gfs]
[<f8de7d63>] gfs_glock_drop_th+0xf3/0x12d [gfs]
[<f8de7257>] rq_demote+0x7f/0x98 [gfs]
[<f8de730e>] run_queue+0x5a/0xc1 [gfs]
[<f8de7431>] unlock_on_glock+0x1f/0x28 [gfs]
[<f8de93e9>] gfs_reclaim_glock+0xc3/0x13c [gfs]
[<f8ddbe05>] gfs_glockd+0x39/0xde [gfs]
[<c011e7b9>] default_wake_function+0x0/0xc
[<c02d8522>] ret_from_fork+0x6/0x14
[<c011e7b9>] default_wake_function+0x0/0xc
[<f8ddbdcc>] gfs_glockd+0x0/0xde [gfs]
[<c01041f5>] kernel_thread_helper+0x5/0xb
Comment 3 RHEL Product and Program Management 2008-07-14 12:41:02 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 5 Ryan O'Hara 2008-08-26 13:01:16 EDT
The correct solution for this problem is to not make fenced recoverable, but to shutdown any node that has a cluster service die unexpectedly.

For example, if fenced dies unexpectedly without properly leaving the fence domain, we have no way of knowing if the fenced was killed intentionally or segfaulted, etc. In either case, they node needs to be removed from the cluster and fully restarted. Killing fenced (or having fenced die unexpectedly) is an invalid case.

We can detect the case where fenced dies unexpectedly. If groupd detects that fenced has died without properly leaving the fence domain first (fence_tool leave), then we should call cman_shutdown(..) to force the node to leave the cluster. This will cause the failed node to be fenced, which is correct behavior. Note that once fenced has been killed it is not possible to correctly remove that node from the fence domain and continue.

We can extend this behavior to other groupd services, including gfs_controld and dlm_controld. So in the end, if any of these daemons (fenced, gfs_controld or dlm_controld) die unexpectedly, we will detect this case in groupd and shutdown/fence that node.

Patch is complete and being tested before commit.
Comment 8 Ryan O'Hara 2008-09-09 11:49:53 EDT
Fixed in RHEL5.

A failed daemon (fenced, dlm_controld, gfs_controld) will cause the node to be removed from the cluster. In any of these daemons die unexpectedly, groupd will detect the failure are remove the node from the cluster via cman_leave_cluster(). The reason that we must remove the node from the cluster is that a failed daemon results in the cluster being in an invalid state. The only safe thing to do is remove the node from the cluster. A message is logged reporting which daemon appears to have died unexpectedly, then the node is forced to leave the cluster.

This behavior is enabled by default, but it can be turned on/off with the -s option to groupd (groupd -s [0|1], where 0 will disable this "shutdown" behavior and 1 will enable it).
Comment 11 errata-xmlrpc 2009-01-20 16:52:47 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0189.html

Note You need to log in before you can comment on or make changes to this bug.