Bug 460190 - new option to delay fence_tool join
Summary: new option to delay fence_tool join
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman
Version: 5.2
Hardware: All
OS: Linux
urgent
medium
Target Milestone: rc
: ---
Assignee: David Teigland
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 443358 471272
TreeView+ depends on / blocked
 
Reported: 2008-08-26 17:37 UTC by David Teigland
Modified: 2018-10-20 01:49 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-20 21:50:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
fence_tool patch (5.52 KB, text/plain)
2008-08-26 21:05 UTC, David Teigland
no flags Details
patch for cman init script (1.92 KB, patch)
2008-08-26 21:40 UTC, David Teigland
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2009:0189 0 normal SHIPPED_LIVE cman bug-fix and enhancement update 2009-01-20 16:05:55 UTC

Description David Teigland 2008-08-26 17:37:21 UTC
Description of problem:

Certain network/switch settings cause nodes to form partitioned clusters
when they start up.  We want to provide information to help people configure
their switches to prevent this (see Documentation note).

We can also add code to better cope with these network problems, since
they seem to be somewhat common.  The network partitions are a particular
problem for two_node clusters where a node has quorum when it starts up
on its own.  There are two parts to this work-around:

1. Add new fence_tool option -m, e.g. fence_tool join -m 45.
This will cause fence_tool to wait for all nodes in cluster.conf
to be cluster members, or the timeout (45 seconds), whichever comes
first, before joining the fence domain.

The idea is that we'd use this option to allow openais on the nodes
to all see each other before starting the fence domain. So we join the
domain *after* the nodes merge into a single cluster.  If we joined the
domain *before* the cluster partition merged, then nodes end up being
fenced unnecessarily.  (This is a similar idea to post_join_delay; a delay
that gives us time to determine that a node in an unknown state is actually
ok and doesn't require fencing.)

2. Use the new fence_tool -m option in the cman init script.  Again, this
is primarily a problem with two_node clusters (because waiting for quorum
usually masks the partitioning problems otherwise).  So, we want the
init script to check if the cluster is two_node, and use -m if it is.
(it could do this by 'grep two_node /etc/cluster/cluster.conf', or
'cman_tool status | grep Flags | grep 2node').  It initially appears that
we'll want a default -m value of about 45 seconds.  Again, if the nodes
converge normally during startup, this delay will be skipped.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 RHEL Program Management 2008-08-26 18:02:40 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 2 David Teigland 2008-08-26 21:05:20 UTC
Created attachment 315052 [details]
fence_tool patch

patch for the fence_tool part of the solution

Comment 3 David Teigland 2008-08-26 21:40:56 UTC
Created attachment 315053 [details]
patch for cman init script

Patch to init.d/cman to use the new fence_tool -m option.

Comment 4 David Teigland 2008-08-27 16:42:56 UTC
pushed to RHEL5 and STABLE2 branches

RHEL5 5ea416d26ec2b6bf605c573a5173736d0f8cd27c 397b8111d2d69b9dd25e7b074822be571f274032

STABLE2 7087a7d5e8c9601a9f405ee71befa3db90256481 41a69f04aeaf9aa3f38c899bf55495f04c19831c

Comment 12 errata-xmlrpc 2009-01-20 21:50:23 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0189.html


Note You need to log in before you can comment on or make changes to this bug.