Bug 460190

Summary: new option to delay fence_tool join
Product: Red Hat Enterprise Linux 5 Reporter: David Teigland <teigland>
Component: cmanAssignee: David Teigland <teigland>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: urgent    
Version: 5.2CC: bkahn, ccaulfie, cluster-maint, djansa, edamato, jplans, lhh, tao, zxvdr.au
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-20 21:50:23 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 443358, 471272    
Attachments:
Description Flags
fence_tool patch
none
patch for cman init script none

Description David Teigland 2008-08-26 17:37:21 UTC
Description of problem:

Certain network/switch settings cause nodes to form partitioned clusters
when they start up.  We want to provide information to help people configure
their switches to prevent this (see Documentation note).

We can also add code to better cope with these network problems, since
they seem to be somewhat common.  The network partitions are a particular
problem for two_node clusters where a node has quorum when it starts up
on its own.  There are two parts to this work-around:

1. Add new fence_tool option -m, e.g. fence_tool join -m 45.
This will cause fence_tool to wait for all nodes in cluster.conf
to be cluster members, or the timeout (45 seconds), whichever comes
first, before joining the fence domain.

The idea is that we'd use this option to allow openais on the nodes
to all see each other before starting the fence domain. So we join the
domain *after* the nodes merge into a single cluster.  If we joined the
domain *before* the cluster partition merged, then nodes end up being
fenced unnecessarily.  (This is a similar idea to post_join_delay; a delay
that gives us time to determine that a node in an unknown state is actually
ok and doesn't require fencing.)

2. Use the new fence_tool -m option in the cman init script.  Again, this
is primarily a problem with two_node clusters (because waiting for quorum
usually masks the partitioning problems otherwise).  So, we want the
init script to check if the cluster is two_node, and use -m if it is.
(it could do this by 'grep two_node /etc/cluster/cluster.conf', or
'cman_tool status | grep Flags | grep 2node').  It initially appears that
we'll want a default -m value of about 45 seconds.  Again, if the nodes
converge normally during startup, this delay will be skipped.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 RHEL Program Management 2008-08-26 18:02:40 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 2 David Teigland 2008-08-26 21:05:20 UTC
Created attachment 315052 [details]
fence_tool patch

patch for the fence_tool part of the solution

Comment 3 David Teigland 2008-08-26 21:40:56 UTC
Created attachment 315053 [details]
patch for cman init script

Patch to init.d/cman to use the new fence_tool -m option.

Comment 4 David Teigland 2008-08-27 16:42:56 UTC
pushed to RHEL5 and STABLE2 branches

RHEL5 5ea416d26ec2b6bf605c573a5173736d0f8cd27c 397b8111d2d69b9dd25e7b074822be571f274032

STABLE2 7087a7d5e8c9601a9f405ee71befa3db90256481 41a69f04aeaf9aa3f38c899bf55495f04c19831c

Comment 12 errata-xmlrpc 2009-01-20 21:50:23 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0189.html