Description of problem: When a fenced process on one node in the cluster dies, the fenced and the fence domain cannot be recovered. Version-Release number of selected component (if applicable): fenced 2.0.70 How reproducible: Always Steps to Reproduce: 1. start one cluster node (node1) 2. kill fenced on this node 3. start another node (node2) and try to join fence domain. Actual results: -The second cluster node cannot join the fence domain. -The first cluster node cannot leave nor rejoin the fence domain. Also a restart of fenced and a fence_tool join on node1 will not help. cman_tool services shows JOIN_STOP_WAIT for the fence service on node1 Expected results: In my opinion a crashed fenced should be recoverable (as in CS4) through a manual or automatic restart. Also if node1 leaves the cluster. the fence domain should be automatically recovered. Additional info:
This type of issue is also being seen in IT#169334. An easily duplication is Steps to Reproduce: 1, Setup a GFS filesystem on some nodes. 2, Then kills process fenced ( killall fenced ) and stops cman-network interface (ifdown eth1 ) on a node. 3, The node will crash and print the following messages when some process attempt to access the GFS filesystem( cd /gfs ): EIP: 0060:[<f8ce2611>] CPU: 0 EIP is at do_dlm_unlock+0xa9/0xbf [lock_dlm] EFLAGS: 00010246 Not tainted (2.6.9-67.ELsmp) EAX: 00000001 EBX: f621d280 ECX: f0417f28 EDX: f8ce72d3 ESI: ffffffea EDI: 00000000 EBP: f8d81000 DS: 007b ES: 007b CR0: 8005003b CR2: 00c56a33 CR3: 37f81e00 CR4: 000006f0 [<f8ce28b2>] lm_dlm_unlock+0x14/0x1c [lock_dlm] [<f8df1ede>] gfs_lm_unlock+0x2c/0x42 [gfs] [<f8de7d63>] gfs_glock_drop_th+0xf3/0x12d [gfs] [<f8de7257>] rq_demote+0x7f/0x98 [gfs] [<f8de730e>] run_queue+0x5a/0xc1 [gfs] [<f8de7431>] unlock_on_glock+0x1f/0x28 [gfs] [<f8de93e9>] gfs_reclaim_glock+0xc3/0x13c [gfs] [<f8ddbe05>] gfs_glockd+0x39/0xde [gfs] [<c011e7b9>] default_wake_function+0x0/0xc [<c02d8522>] ret_from_fork+0x6/0x14 [<c011e7b9>] default_wake_function+0x0/0xc [<f8ddbdcc>] gfs_glockd+0x0/0xde [gfs] [<c01041f5>] kernel_thread_helper+0x5/0xb
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
The correct solution for this problem is to not make fenced recoverable, but to shutdown any node that has a cluster service die unexpectedly. For example, if fenced dies unexpectedly without properly leaving the fence domain, we have no way of knowing if the fenced was killed intentionally or segfaulted, etc. In either case, they node needs to be removed from the cluster and fully restarted. Killing fenced (or having fenced die unexpectedly) is an invalid case. We can detect the case where fenced dies unexpectedly. If groupd detects that fenced has died without properly leaving the fence domain first (fence_tool leave), then we should call cman_shutdown(..) to force the node to leave the cluster. This will cause the failed node to be fenced, which is correct behavior. Note that once fenced has been killed it is not possible to correctly remove that node from the fence domain and continue. We can extend this behavior to other groupd services, including gfs_controld and dlm_controld. So in the end, if any of these daemons (fenced, gfs_controld or dlm_controld) die unexpectedly, we will detect this case in groupd and shutdown/fence that node. Patch is complete and being tested before commit.
Fixed in RHEL5. A failed daemon (fenced, dlm_controld, gfs_controld) will cause the node to be removed from the cluster. In any of these daemons die unexpectedly, groupd will detect the failure are remove the node from the cluster via cman_leave_cluster(). The reason that we must remove the node from the cluster is that a failed daemon results in the cluster being in an invalid state. The only safe thing to do is remove the node from the cluster. A message is logged reporting which daemon appears to have died unexpectedly, then the node is forced to leave the cluster. This behavior is enabled by default, but it can be turned on/off with the -s option to groupd (groupd -s [0|1], where 0 will disable this "shutdown" behavior and 1 will enable it).
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-0189.html