Description of problem: On a cluster configured with four nodes, three of the nodes are booted while the fourth node remains powered off. When fenced starts on each of the three nodes, it reports "lock_dlm: fence domain not found; check fenced". The problem only occurs on initial boot up of a cluster if all of the configured nodes are not started. Version-Release number of selected component (if applicable): dlm-1.0.0-5 fenced-1.32.18-0 How reproducible: Always Steps to Reproduce: 1. Configure a four node cluster 2. Start three of the four nodes while the fourth is still powered off 3. Observe that fenced fails to start on each of the three nodes Actual results: Fenced fails to start Expected results: Fenced starts Additional info: Instead of using /etc/init.d/fenced to start fenced we are using our own script which calls: exec /sbin/fence_tool join -w -D
The "lock_dlm: fence domain not found" message is reported when gfs is mounted. fenced on the 3 nodes should be fencing the fourth node, which may be taking some time (how much depends on the post_join_delay value in cluster.conf). Until fenced completes starting on the 3 nodes, gfs can't be mounted on those nodes. fenced won't complete starting on the 3 until the fourth node has been fenced by them. We should be able to confirm this by looking in /var/log/messages and by running cman_tool services before and after the failed mount attempt.
In the past we have seen the situation that you describe where fencing cannot be completed because our fencing method was dependent on having access to shared storage. In those instances we saw messages that node 1 was attempting to fence node 4 but that fencing failed. Fenced would then continue to attempt to fence node 4 and continue to fail until the problem was resolved. This case seems to be a little different. We do not see any messages that fenced is trying to fence the fourth node. Here is a snippet from the /var/log/messages on node 1. Nodes 2 and 3 show the same messages with no indication that node 4 is being fenced. Jun 28 10:57:01 sqazero01 kernel: CMAN <CVS> (built Jun 10 2006 03:13:36) installed Jun 28 10:57:01 sqazero01 kernel: NET: Registered protocol family 30 Jun 28 10:57:02 sqazero01 kernel: CMAN: Waiting to join or form a Linux-cluster Jun 28 10:57:18 sqazero01 fstab-sync[6351]: added mount point /media/cdrom for /dev/hdc Jun 28 10:57:31 sqazero01 hald[24459]: Timed out waiting for hotplug event 2601. Rebasing to 2464 Jun 28 10:57:34 sqazero01 kernel: CMAN: forming a new cluster Jun 28 10:58:30 sqazero01 kernel: CMAN: quorum regained, resuming activity Jun 28 10:58:30 sqazero01 kernel: DLM <CVS> (built Jun 10 2006 03:13:46) installed Jun 28 10:58:30 sqazero01 kernel: DLM Opaque Thread started Jun 28 10:58:30 sqazero01 cman: startup succeeded Jun 28 10:58:31 sqazero01 hald[24459]: Timed out waiting for hotplug event 2465. Rebasing to 2601 Jun 28 10:58:36 sqazero01 clvmd: Cluster LVM daemon started - connected to CMAN Jun 28 10:58:49 sqazero01 ntpd[24318]: synchronized to 10.250.0.9, stratum 2 Jun 28 10:58:49 sqazero01 ntpd[24318]: kernel time sync disabled 0041 Jun 28 10:58:50 sqazero01 kernel: Lock_Harness <CVS> (built Jun 10 2006 03:04:32) installed Jun 28 10:58:50 sqazero01 kernel: GFS <CVS> (built Jun 10 2006 03:04:18) installed Jun 28 10:58:50 sqazero01 kernel: GFS: Trying to join cluster "lock_dlm", "sqazero:crosswalk" Jun 28 10:58:50 sqazero01 kernel: Lock_DLM (built Jun 10 2006 03:04:25) installed Jun 28 10:58:50 sqazero01 kernel: lock_dlm: fence domain not found; check fenced Jun 28 10:58:50 sqazero01 kernel: GFS: can't mount proto = lock_dlm, table = sqazero:crosswalk, hostdata = Jun 28 10:58:56 sqazero01 kernel: GFS: Trying to join cluster "lock_dlm", "sqazero:crosswalk" Jun 28 10:58:56 sqazero01 kernel: lock_dlm: fence domain not found; check fenced Jun 28 10:58:56 sqazero01 kernel: GFS: can't mount proto = lock_dlm, table = sqazero:crosswalk, hostdata =
- Do you have post_join_delay set in cluster.conf? If so what is that value? - Could you collect the output of cman_tool services before and after the gfs mount that fails? Preferably from all the nodes in the cluster. - The output of fenced -D may also be useful to see.
post_join_delay is set to 180. Where can I find the output of fenced -D? I'll collect the cman_tool services output when I have access to the system again.
A post_join_delay of 180 means that fenced will delay 3 minutes before fencing the 4'th non-member node at startup time. Any gfs mount attempts in this time will fail like you're seeing. If you don't want the gfs mount failures, you'll need to wait for this fenced startup before attempting the mounts. I'm guessing that this fully explains what you're seeing. You'd have to redirect the output of fenced -D to some file if you want to save it.
Any update on this one?
Haven't seen this problem in a long time. Feel free to close as invalid.