Description of problem: Certain network/switch settings cause nodes to form partitioned clusters when they start up. We want to provide information to help people configure their switches to prevent this (see Documentation note). We can also add code to better cope with these network problems, since they seem to be somewhat common. The network partitions are a particular problem for two_node clusters where a node has quorum when it starts up on its own. There are two parts to this work-around: 1. Add new fence_tool option -m, e.g. fence_tool join -m 45. This will cause fence_tool to wait for all nodes in cluster.conf to be cluster members, or the timeout (45 seconds), whichever comes first, before joining the fence domain. The idea is that we'd use this option to allow openais on the nodes to all see each other before starting the fence domain. So we join the domain *after* the nodes merge into a single cluster. If we joined the domain *before* the cluster partition merged, then nodes end up being fenced unnecessarily. (This is a similar idea to post_join_delay; a delay that gives us time to determine that a node in an unknown state is actually ok and doesn't require fencing.) 2. Use the new fence_tool -m option in the cman init script. Again, this is primarily a problem with two_node clusters (because waiting for quorum usually masks the partitioning problems otherwise). So, we want the init script to check if the cluster is two_node, and use -m if it is. (it could do this by 'grep two_node /etc/cluster/cluster.conf', or 'cman_tool status | grep Flags | grep 2node'). It initially appears that we'll want a default -m value of about 45 seconds. Again, if the nodes converge normally during startup, this delay will be skipped. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Created attachment 315052 [details] fence_tool patch patch for the fence_tool part of the solution
Created attachment 315053 [details] patch for cman init script Patch to init.d/cman to use the new fence_tool -m option.
pushed to RHEL5 and STABLE2 branches RHEL5 5ea416d26ec2b6bf605c573a5173736d0f8cd27c 397b8111d2d69b9dd25e7b074822be571f274032 STABLE2 7087a7d5e8c9601a9f405ee71befa3db90256481 41a69f04aeaf9aa3f38c899bf55495f04c19831c
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-0189.html