Description of problem:
When a fenced process on one node in the cluster dies, the fenced and the
fence domain cannot be recovered.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. start one cluster node (node1)
2. kill fenced on this node
3. start another node (node2) and try to join fence domain.
-The second cluster node cannot join the fence domain.
-The first cluster node cannot leave nor rejoin the fence domain.
Also a restart of fenced and a fence_tool join on node1 will not help.
cman_tool services shows JOIN_STOP_WAIT for the fence
service on node1
In my opinion a crashed fenced should be recoverable (as in CS4) through a
manual or automatic restart. Also if node1 leaves the cluster. the fence
domain should be automatically recovered.
This type of issue is also being seen in IT#169334. An easily duplication is
Steps to Reproduce:
1, Setup a GFS filesystem on some nodes.
2, Then kills process fenced ( killall fenced ) and stops cman-network interface
(ifdown eth1 ) on a node.
3, The node will crash and print the following messages when some process
attempt to access the GFS filesystem( cd /gfs ):
EIP: 0060:[<f8ce2611>] CPU: 0
EIP is at do_dlm_unlock+0xa9/0xbf [lock_dlm]
EFLAGS: 00010246 Not tainted (2.6.9-67.ELsmp)
EAX: 00000001 EBX: f621d280 ECX: f0417f28 EDX: f8ce72d3
ESI: ffffffea EDI: 00000000 EBP: f8d81000 DS: 007b ES: 007b
CR0: 8005003b CR2: 00c56a33 CR3: 37f81e00 CR4: 000006f0
[<f8ce28b2>] lm_dlm_unlock+0x14/0x1c [lock_dlm]
[<f8df1ede>] gfs_lm_unlock+0x2c/0x42 [gfs]
[<f8de7d63>] gfs_glock_drop_th+0xf3/0x12d [gfs]
[<f8de7257>] rq_demote+0x7f/0x98 [gfs]
[<f8de730e>] run_queue+0x5a/0xc1 [gfs]
[<f8de7431>] unlock_on_glock+0x1f/0x28 [gfs]
[<f8de93e9>] gfs_reclaim_glock+0xc3/0x13c [gfs]
[<f8ddbe05>] gfs_glockd+0x39/0xde [gfs]
[<f8ddbdcc>] gfs_glockd+0x0/0xde [gfs]
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
The correct solution for this problem is to not make fenced recoverable, but to shutdown any node that has a cluster service die unexpectedly.
For example, if fenced dies unexpectedly without properly leaving the fence domain, we have no way of knowing if the fenced was killed intentionally or segfaulted, etc. In either case, they node needs to be removed from the cluster and fully restarted. Killing fenced (or having fenced die unexpectedly) is an invalid case.
We can detect the case where fenced dies unexpectedly. If groupd detects that fenced has died without properly leaving the fence domain first (fence_tool leave), then we should call cman_shutdown(..) to force the node to leave the cluster. This will cause the failed node to be fenced, which is correct behavior. Note that once fenced has been killed it is not possible to correctly remove that node from the fence domain and continue.
We can extend this behavior to other groupd services, including gfs_controld and dlm_controld. So in the end, if any of these daemons (fenced, gfs_controld or dlm_controld) die unexpectedly, we will detect this case in groupd and shutdown/fence that node.
Patch is complete and being tested before commit.
Fixed in RHEL5.
A failed daemon (fenced, dlm_controld, gfs_controld) will cause the node to be removed from the cluster. In any of these daemons die unexpectedly, groupd will detect the failure are remove the node from the cluster via cman_leave_cluster(). The reason that we must remove the node from the cluster is that a failed daemon results in the cluster being in an invalid state. The only safe thing to do is remove the node from the cluster. A message is logged reporting which daemon appears to have died unexpectedly, then the node is forced to leave the cluster.
This behavior is enabled by default, but it can be turned on/off with the -s option to groupd (groupd -s [0|1], where 0 will disable this "shutdown" behavior and 1 will enable it).
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.