Bug 197154

Summary:	Fenced fails to find fence domain on initial startup if all configured nodes are not started
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Henry Harris <henry.harris>
Component:	dlm	Assignee:	David Teigland <teigland>
Status:	CLOSED NOTABUG	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4	CC:	ccaulfie, cluster-maint
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2006-10-17 16:28:01 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Henry Harris 2006-06-28 21:38:52 UTC

Description of problem: On a cluster configured with four nodes, three of the 
nodes are booted while the fourth node remains powered off.  When fenced 
starts on each of the three nodes, it reports "lock_dlm: fence domain not 
found; check fenced".  The problem only occurs on initial boot up of a cluster 
if all of the configured nodes are not started.


Version-Release number of selected component (if applicable):
dlm-1.0.0-5
fenced-1.32.18-0

How reproducible:
Always

Steps to Reproduce:
1. Configure a four node cluster
2. Start three of the four nodes while the fourth is still powered off
3. Observe that fenced fails to start on each of the three nodes
  
Actual results:
Fenced fails to start

Expected results:
Fenced starts

Additional info:
Instead of using /etc/init.d/fenced to start fenced we are using our own 
script which calls:

exec /sbin/fence_tool join -w -D

Comment 1 David Teigland 2006-06-28 21:50:59 UTC

The "lock_dlm: fence domain not found" message is reported when gfs
is mounted.

fenced on the 3 nodes should be fencing the fourth node, which may
be taking some time (how much depends on the post_join_delay value in
cluster.conf).  Until fenced completes starting on the 3 nodes, gfs
can't be mounted on those nodes.  fenced won't complete starting
on the 3 until the fourth node has been fenced by them.  

We should be able to confirm this by looking in /var/log/messages
and by running cman_tool services before and after the failed mount
attempt.

Comment 2 Henry Harris 2006-06-29 15:29:41 UTC

In the past we have seen the situation that you describe where fencing cannot 
be completed because our fencing method was dependent on having access to 
shared storage.  In those instances we saw messages that node 1 was attempting 
to fence node 4 but that fencing failed.  Fenced would then continue to 
attempt to fence node 4 and continue to fail until the problem was resolved.
This case seems to be a little different.  We do not see any messages that 
fenced is trying to fence the fourth node.  Here is a snippet from 
the /var/log/messages on node 1.  Nodes 2 and 3 show the same messages with no 
indication that node 4 is being fenced.

Jun 28 10:57:01 sqazero01 kernel: CMAN <CVS> (built Jun 10 2006 03:13:36) 
installed
Jun 28 10:57:01 sqazero01 kernel: NET: Registered protocol family 30
Jun 28 10:57:02 sqazero01 kernel: CMAN: Waiting to join or form a Linux-cluster
Jun 28 10:57:18 sqazero01 fstab-sync[6351]: added mount point /media/cdrom 
for /dev/hdc
Jun 28 10:57:31 sqazero01 hald[24459]: Timed out waiting for hotplug event 
2601. Rebasing to 2464
Jun 28 10:57:34 sqazero01 kernel: CMAN: forming a new cluster
Jun 28 10:58:30 sqazero01 kernel: CMAN: quorum regained, resuming activity
Jun 28 10:58:30 sqazero01 kernel: DLM <CVS> (built Jun 10 2006 03:13:46) 
installed
Jun 28 10:58:30 sqazero01 kernel: DLM Opaque Thread started
Jun 28 10:58:30 sqazero01 cman: startup succeeded
Jun 28 10:58:31 sqazero01 hald[24459]: Timed out waiting for hotplug event 
2465. Rebasing to 2601
Jun 28 10:58:36 sqazero01 clvmd: Cluster LVM daemon started - connected to CMAN
Jun 28 10:58:49 sqazero01 ntpd[24318]: synchronized to 10.250.0.9, stratum 2
Jun 28 10:58:49 sqazero01 ntpd[24318]: kernel time sync disabled 0041
Jun 28 10:58:50 sqazero01 kernel: Lock_Harness <CVS> (built Jun 10 2006 
03:04:32) installed
Jun 28 10:58:50 sqazero01 kernel: GFS <CVS> (built Jun 10 2006 03:04:18) 
installed
Jun 28 10:58:50 sqazero01 kernel: GFS: Trying to join 
cluster "lock_dlm", "sqazero:crosswalk"
Jun 28 10:58:50 sqazero01 kernel: Lock_DLM (built Jun 10 2006 03:04:25) 
installed
Jun 28 10:58:50 sqazero01 kernel: lock_dlm: fence domain not found; check 
fenced
Jun 28 10:58:50 sqazero01 kernel: GFS: can't mount proto = lock_dlm, table = 
sqazero:crosswalk, hostdata =
Jun 28 10:58:56 sqazero01 kernel: GFS: Trying to join 
cluster "lock_dlm", "sqazero:crosswalk"
Jun 28 10:58:56 sqazero01 kernel: lock_dlm: fence domain not found; check 
fenced
Jun 28 10:58:56 sqazero01 kernel: GFS: can't mount proto = lock_dlm, table = 
sqazero:crosswalk, hostdata =

Comment 3 David Teigland 2006-06-29 16:17:39 UTC

- Do you have post_join_delay set in cluster.conf?  If so what is that
value?

- Could you collect the output of cman_tool services before and after the
gfs mount that fails?  Preferably from all the nodes in the cluster.

- The output of fenced -D may also be useful to see.

Comment 4 Henry Harris 2006-06-29 19:05:24 UTC

post_join_delay is set to 180.  Where can I find the output of fenced -D?  
I'll collect the cman_tool services output when I have access to the system 
again.

Comment 5 David Teigland 2006-06-29 19:14:40 UTC

A post_join_delay of 180 means that fenced will delay 3 minutes
before fencing the 4'th non-member node at startup time.  Any
gfs mount attempts in this time will fail like you're seeing.
If you don't want the gfs mount failures, you'll need to wait
for this fenced startup before attempting the mounts.  I'm
guessing that this fully explains what you're seeing.

You'd have to redirect the output of fenced -D to some file
if you want to save it.

Comment 6 Kiersten (Kerri) Anderson 2006-09-22 19:12:43 UTC

Any update on this one?

Comment 7 Henry Harris 2006-09-22 19:16:28 UTC

Haven't seen this problem in a long time.  Feel free to close as invalid.