Bug 197154
Summary: | Fenced fails to find fence domain on initial startup if all configured nodes are not started | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Henry Harris <henry.harris> |
Component: | dlm | Assignee: | David Teigland <teigland> |
Status: | CLOSED NOTABUG | QA Contact: | Cluster QE <mspqa-list> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4 | CC: | ccaulfie, cluster-maint |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-10-17 16:28:01 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Henry Harris
2006-06-28 21:38:52 UTC
The "lock_dlm: fence domain not found" message is reported when gfs is mounted. fenced on the 3 nodes should be fencing the fourth node, which may be taking some time (how much depends on the post_join_delay value in cluster.conf). Until fenced completes starting on the 3 nodes, gfs can't be mounted on those nodes. fenced won't complete starting on the 3 until the fourth node has been fenced by them. We should be able to confirm this by looking in /var/log/messages and by running cman_tool services before and after the failed mount attempt. In the past we have seen the situation that you describe where fencing cannot be completed because our fencing method was dependent on having access to shared storage. In those instances we saw messages that node 1 was attempting to fence node 4 but that fencing failed. Fenced would then continue to attempt to fence node 4 and continue to fail until the problem was resolved. This case seems to be a little different. We do not see any messages that fenced is trying to fence the fourth node. Here is a snippet from the /var/log/messages on node 1. Nodes 2 and 3 show the same messages with no indication that node 4 is being fenced. Jun 28 10:57:01 sqazero01 kernel: CMAN <CVS> (built Jun 10 2006 03:13:36) installed Jun 28 10:57:01 sqazero01 kernel: NET: Registered protocol family 30 Jun 28 10:57:02 sqazero01 kernel: CMAN: Waiting to join or form a Linux-cluster Jun 28 10:57:18 sqazero01 fstab-sync[6351]: added mount point /media/cdrom for /dev/hdc Jun 28 10:57:31 sqazero01 hald[24459]: Timed out waiting for hotplug event 2601. Rebasing to 2464 Jun 28 10:57:34 sqazero01 kernel: CMAN: forming a new cluster Jun 28 10:58:30 sqazero01 kernel: CMAN: quorum regained, resuming activity Jun 28 10:58:30 sqazero01 kernel: DLM <CVS> (built Jun 10 2006 03:13:46) installed Jun 28 10:58:30 sqazero01 kernel: DLM Opaque Thread started Jun 28 10:58:30 sqazero01 cman: startup succeeded Jun 28 10:58:31 sqazero01 hald[24459]: Timed out waiting for hotplug event 2465. Rebasing to 2601 Jun 28 10:58:36 sqazero01 clvmd: Cluster LVM daemon started - connected to CMAN Jun 28 10:58:49 sqazero01 ntpd[24318]: synchronized to 10.250.0.9, stratum 2 Jun 28 10:58:49 sqazero01 ntpd[24318]: kernel time sync disabled 0041 Jun 28 10:58:50 sqazero01 kernel: Lock_Harness <CVS> (built Jun 10 2006 03:04:32) installed Jun 28 10:58:50 sqazero01 kernel: GFS <CVS> (built Jun 10 2006 03:04:18) installed Jun 28 10:58:50 sqazero01 kernel: GFS: Trying to join cluster "lock_dlm", "sqazero:crosswalk" Jun 28 10:58:50 sqazero01 kernel: Lock_DLM (built Jun 10 2006 03:04:25) installed Jun 28 10:58:50 sqazero01 kernel: lock_dlm: fence domain not found; check fenced Jun 28 10:58:50 sqazero01 kernel: GFS: can't mount proto = lock_dlm, table = sqazero:crosswalk, hostdata = Jun 28 10:58:56 sqazero01 kernel: GFS: Trying to join cluster "lock_dlm", "sqazero:crosswalk" Jun 28 10:58:56 sqazero01 kernel: lock_dlm: fence domain not found; check fenced Jun 28 10:58:56 sqazero01 kernel: GFS: can't mount proto = lock_dlm, table = sqazero:crosswalk, hostdata = - Do you have post_join_delay set in cluster.conf? If so what is that value? - Could you collect the output of cman_tool services before and after the gfs mount that fails? Preferably from all the nodes in the cluster. - The output of fenced -D may also be useful to see. post_join_delay is set to 180. Where can I find the output of fenced -D? I'll collect the cman_tool services output when I have access to the system again. A post_join_delay of 180 means that fenced will delay 3 minutes before fencing the 4'th non-member node at startup time. Any gfs mount attempts in this time will fail like you're seeing. If you don't want the gfs mount failures, you'll need to wait for this fenced startup before attempting the mounts. I'm guessing that this fully explains what you're seeing. You'd have to redirect the output of fenced -D to some file if you want to save it. Any update on this one? Haven't seen this problem in a long time. Feel free to close as invalid. |