Bug 503306 - Fence_tool should keep retrying and join back to fence_domain automatically when the node become quorate
Summary: Fence_tool should keep retrying and join back to fence_domain automatically w...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman
Version: 5.3
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Marek Grac
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-05-31 07:38 UTC by Zhenyong(Jerry) Jiang
Modified: 2018-10-20 03:56 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-09-23 15:52:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Zhenyong(Jerry) Jiang 2009-05-31 07:38:20 UTC
Description of problem:

If fence_tool timeout at startup, it wont keep retry and re-join the fence_domain automatically when node become quorate later.   Customer wish cman/fence_tool can do it without any manual interaction. 

How reproducible:

100%, RHEL5U3 has this problem also.

This is a four-nodes GFS cluster in this example.

Steps to reproduce the issue:

1, Power on 2 of nodes and keep waiting until "fence_tool join" timeout. In this step, node-1 and node-2 may have output similar to followed.

Starting cluster:
 Loading modules... done
 Mounting configfs... done
 Starting ccsd... done
 Starting cman... done
 Starting daemons... done
 Starting fencing... failed
                                                         [FAILED]   << "fence_tool join" failed due to timeout

2, After node dellpc1 & dellpc2 startup,  we can find those nodes did not joined the fence doamin. 

[root@dellpc1 ~]# clustat
Cluster Status for new_clusterdell @ Tue May 19 10:18:03 2009
Member Status: Inquorate              <<<<< This is a 4-nodes cluster and 2 votes is not enough for quorate
Member Name                            ID   Status
------ ----                            ---- ------
dellpc1                                    1 Online, Local
dellpc2                                    2 Online
dellpc3                                    3 Offline
dellpc4                                    4 Offline
[root@dellpc1 ~]# group_tool             <<<<<< Note: there is no group "fence" here since fence_tool join timeout.
type             level name  id       state      
[root@dellpc1 ~]#


3, Ensure fence_tool join is timeout on node-1 and node-2, and then power on the node-3.  
  
Starting cluster:
 Loading modules... done
 Mounting configfs... done
 Starting ccsd... done
 Starting cman... done
 Starting daemons... done
 Starting fencing... done
                                                         [  OK  ]
<<< Fencing can successfully startup in this step since cluster is quorate at this time.

After cman starup ok node-3, check cluster status and fence domain on the 3 nodes. 

This is from node-3
[root@dellpc3 ~]# clustat
Cluster Status for new_clusterdell @ Fri May 22 18:20:54 2009
Member Status: Quorate                     <<<  Cluster is quorate at this time.
Member Name                            ID   Status
------ ----                            ---- ------
dellpc1                                    1 Online
dellpc2                                    2 Online
dellpc3                                    3 Online, Local
dellpc4                                    4 Offline
[root@dellpc3 ~]# group_tool             <<<<  there is a fence domain here and only node-3 in this domain.
type             level name     id       state      
fence            0     default  00010003 none        
[3]

And this is from node-1 and node-2:

[root@dellpc1 ~]# clustat
Cluster Status for new_clusterdell @ Tue May 19 10:21:34 2009
Member Status: Quorate
Member Name                            ID   Status
------ ----                            ---- ------
dellpc1                                    1 Online, Local
dellpc2                                    2 Online
dellpc3                                    3 Online   <<  node-3 is online at this time
dellpc4                                    4 Offline
[root@dellpc1 ~]#
[root@dellpc1 ~]# group_tool                  << but still no fence domain in node-1 and node-2.
type             level name  id       state   

####Notes: #### 
Customer wish node-1 and node-2 can join back to fence_domain just as it can join back to cluster when node-3 go to online. 

for node-4, it will be fenced by node-3 and startup automatically. 


4, GFS need the member has fence domain, otherwise it wont start up.

node-2 and node-1 can not,  gfs mount will **FAIL** at this time.
[root@dellpc1 ~]# mount /dev/sdb1 /mnt/ -t gfs
/sbin/mount.gfs: node not a member of the default fence domain
/sbin/mount.gfs: error mounting lockproto lock_dlm

But node-3 was in a fence domain so it can mount the gfs at this time.
[root@dellpc3 ~]# mount /dev/sdb1 /mnt/
[root@dellpc3 ~]#


5, node-1 and node-2 need re-execute "fence_tool join" by hand to rejoin fence_domain and mount the gfs.

[root@dellpc1 ~]# fence_tool join
[root@dellpc1 ~]# mount /dev/sdb1 /mnt/ -t gfs
[root@dellpc1 ~]#


Actual results:

If fence_tool timeout when cman starting. cluster wont re-join the fence_domain automatically when node become quorate later.

Expected results:

Cluster should re-join the fence_domain automatically when the node become quorate.  in this example, I wish cman on node-1 and node-2 can execute "fence_tool join" automatically when they get quorum in step3.

Comment 2 Lachlan McIlroy 2009-06-03 07:31:09 UTC
Added issue 300821 and adjusted priority/severity to match.

Comment 4 Lon Hohberger 2009-09-23 15:52:59 UTC
After talking with other developers, we don't have a good way to solve this with the initscript system we have in place.

The safest thing we can do is set fenced's start timeout appropriately to require quorum:

echo FENCED_START_TIMEOUT=0 >> /etc/sysconfig/cman


Note You need to log in before you can comment on or make changes to this bug.