Description of problem: When another groupd process fails, e.g. killall -9 groupd, the "reason" in the cpg confchg is LEAVE, instead of PROCDOWN. (openais bz 599654) This means groupd processes the removal of the node as a clean shutdown instead of recovering it as a failure. Work around this by detecting when a node is removed from the groupd cpg but is still a member of other cpg's for fenced/dlm/gfs, and treat the "leave" as a "procdown" in that case. This workaround probably obviates the workaround for bz 521817, but leaves it in place. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 419564 [details] patch groupd: recover for groupd failure bz 599747 When another groupd process fails, e.g. killall -9 groupd, the "reason" in the cpg confchg is LEAVE, instead of PROCDOWN. This means groupd processes the removal of the node as a clean shutdown instead of recovering it as a failure. Work around this by detecting when a node is removed from the groupd cpg but is still a member of other cpg's for fenced/dlm/gfs, and treat the "leave" as a "procdown" in that case. This workaround probably obviates the workaround for bz 521817, "groupd: clean up leaving failed node" but leaves it in place.
The test I'm using is simply: on nodes 1-4: service cman start service clvmd start mount gfs on node1: killall -9 groupd Without the patch in comment 1, all daemons on node1 exit except for aisexec. nodes 2-4 show: [root@z2 ~]# group_tool -v type level name id state node id local_done fence 0 default 00010002 LEAVE_STOP_WAIT 1 100030002 1 [1 2 3 4] dlm 1 clvmd 00010004 LEAVE_STOP_WAIT 1 100030002 1 [1 2 3 4] dlm 1 vedder0 00020001 LEAVE_STOP_WAIT 1 100030002 1 [1 2 3 4] gfs 2 vedder0 00010001 LEAVE_STOP_WAIT 1 100030002 1 [1 2 3 4] If you next killall -9 aisexec on node1, then nodes 2-4 just remove node1 without any recovery, due to the combination of two things: 1. openais reporting the gropud failure is a LEAVE rather than PROCDOWN, as reported in bug 599654 2. the groupd patch in bug 521817 The simple removal of node1 without recovery is a dangerous result that could lead to gfs corruption. Using the patch in comment 1: when nodes 2-4 see groupd killed on node1, they will kill aisexec on node1 and recover everything properly.
This patch should be expedited for any releases where bug 521817 has been fixed. The fix for bug 521817 on its own is dangerous because it opens the possibility of a failed node not being recovered.
With bug 599654 fixed, this workaround is not needed.