Description of problem: If fence_tool timeout at startup, it wont keep retry and re-join the fence_domain automatically when node become quorate later. Customer wish cman/fence_tool can do it without any manual interaction. How reproducible: 100%, RHEL5U3 has this problem also. This is a four-nodes GFS cluster in this example. Steps to reproduce the issue: 1, Power on 2 of nodes and keep waiting until "fence_tool join" timeout. In this step, node-1 and node-2 may have output similar to followed. Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... done Starting daemons... done Starting fencing... failed [FAILED] << "fence_tool join" failed due to timeout 2, After node dellpc1 & dellpc2 startup, we can find those nodes did not joined the fence doamin. [root@dellpc1 ~]# clustat Cluster Status for new_clusterdell @ Tue May 19 10:18:03 2009 Member Status: Inquorate <<<<< This is a 4-nodes cluster and 2 votes is not enough for quorate Member Name ID Status ------ ---- ---- ------ dellpc1 1 Online, Local dellpc2 2 Online dellpc3 3 Offline dellpc4 4 Offline [root@dellpc1 ~]# group_tool <<<<<< Note: there is no group "fence" here since fence_tool join timeout. type level name id state [root@dellpc1 ~]# 3, Ensure fence_tool join is timeout on node-1 and node-2, and then power on the node-3. Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... done Starting daemons... done Starting fencing... done [ OK ] <<< Fencing can successfully startup in this step since cluster is quorate at this time. After cman starup ok node-3, check cluster status and fence domain on the 3 nodes. This is from node-3 [root@dellpc3 ~]# clustat Cluster Status for new_clusterdell @ Fri May 22 18:20:54 2009 Member Status: Quorate <<< Cluster is quorate at this time. Member Name ID Status ------ ---- ---- ------ dellpc1 1 Online dellpc2 2 Online dellpc3 3 Online, Local dellpc4 4 Offline [root@dellpc3 ~]# group_tool <<<< there is a fence domain here and only node-3 in this domain. type level name id state fence 0 default 00010003 none [3] And this is from node-1 and node-2: [root@dellpc1 ~]# clustat Cluster Status for new_clusterdell @ Tue May 19 10:21:34 2009 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ dellpc1 1 Online, Local dellpc2 2 Online dellpc3 3 Online << node-3 is online at this time dellpc4 4 Offline [root@dellpc1 ~]# [root@dellpc1 ~]# group_tool << but still no fence domain in node-1 and node-2. type level name id state ####Notes: #### Customer wish node-1 and node-2 can join back to fence_domain just as it can join back to cluster when node-3 go to online. for node-4, it will be fenced by node-3 and startup automatically. 4, GFS need the member has fence domain, otherwise it wont start up. node-2 and node-1 can not, gfs mount will **FAIL** at this time. [root@dellpc1 ~]# mount /dev/sdb1 /mnt/ -t gfs /sbin/mount.gfs: node not a member of the default fence domain /sbin/mount.gfs: error mounting lockproto lock_dlm But node-3 was in a fence domain so it can mount the gfs at this time. [root@dellpc3 ~]# mount /dev/sdb1 /mnt/ [root@dellpc3 ~]# 5, node-1 and node-2 need re-execute "fence_tool join" by hand to rejoin fence_domain and mount the gfs. [root@dellpc1 ~]# fence_tool join [root@dellpc1 ~]# mount /dev/sdb1 /mnt/ -t gfs [root@dellpc1 ~]# Actual results: If fence_tool timeout when cman starting. cluster wont re-join the fence_domain automatically when node become quorate later. Expected results: Cluster should re-join the fence_domain automatically when the node become quorate. in this example, I wish cman on node-1 and node-2 can execute "fence_tool join" automatically when they get quorum in step3.
Added issue 300821 and adjusted priority/severity to match.
After talking with other developers, we don't have a good way to solve this with the initscript system we have in place. The safest thing we can do is set fenced's start timeout appropriately to require quorum: echo FENCED_START_TIMEOUT=0 >> /etc/sysconfig/cman