Bug 207004
Summary: | RHEL5 cman allows node to join cluster even though it's being fenced. | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Josef Bacik <jbacik> |
Component: | cman | Assignee: | Christine Caulfield <ccaulfie> |
Status: | CLOSED NOTABUG | QA Contact: | Cluster QE <mspqa-list> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 5.0 | CC: | cluster-maint, o.martinez, teigland |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-09-19 15:08:05 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Josef Bacik
2006-09-18 18:36:07 UTC
Joining the cluster is fine. What should /not/ happen is that an unfenced node re-joins the fence domain. That would be a bug in the fencing daemon I suspect. I think the fence domain was in JOIN_START_WAIT or something like that, the thing that worried me was that rgmanager started fine. I will upgrade everything this morning and try again and see if it still happens. JOIN_START_WAIT sounds fine then. Maybe rgmanager needs to check the fenced flag in cman ? I'm not sure why it shoud be different though, because RHEL4 cman allowed unfenced nodes to join the cman layer of the cluster too. fence & rgmanager are layered above the cluster manager. Sep 19 10:55:03 rh5cluster1 fenced[5340]: rh5cluster2.gsslab.rdu.redhat.com not a cluster member after 3 sec post_join_delay Sep 19 10:55:03 rh5cluster1 fenced[5340]: fencing node "rh5cluster2.gsslab.rdu.redhat.com" Sep 19 10:55:08 rh5cluster1 fenced[5340]: agent "fence_apc" reports: failed: unrecognised menu response Sep 19 10:55:08 rh5cluster1 fenced[5340]: fence "rh5cluster2.gsslab.rdu.redhat.com" failed Sep 19 10:55:13 rh5cluster1 fenced[5340]: fencing node "rh5cluster2.gsslab.rdu.redhat.com" Sep 19 10:55:15 rh5cluster1 fenced[5340]: agent "fence_apc" reports: failed: unrecognised menu response Sep 19 10:55:15 rh5cluster1 fenced[5340]: fence "rh5cluster2.gsslab.rdu.redhat.com" failed [root@rh5cluster2 ~]# cman_tool services type level name id state fence 0 default 00010001 none [1 2] [root@rh5cluster2 ~]# cman_tool nodes Node Sts Inc Joined Name 1 M 88 2006-09-19 10:55:18 rh5cluster1.gsslab.rdu.redhat.com 2 M 84 2006-09-19 10:55:18 rh5cluster2.gsslab.rdu.redhat.com and now rh5cluster1 is no longer trying to fence rh5cluster2. I guess i like this better, that way if the fencing fails you don't have this hang because its trying to fence a node that took too long to bootup. rh5cluster1 has stopped trying to fence the node after the node joined the cluster, which works for me. Well crap looks like fenced kicked back on when I tried to start clvm, now we're back in a messed up state [root@rh5cluster1 ~]# cman_tool services type level name id state fence 0 default 00010001 JOIN_START_WAIT [1 2] dlm 1 clvmd 00020001 JOIN_START_WAIT [1 2] tho it looks like this is the reason fenced kicked back in Sep 19 11:08:52 rh5cluster1 openais[5314]: [TOTEM] Sending initial ORF token Sep 19 11:08:52 rh5cluster1 openais[5314]: [CLM ] CLM CONFIGURATION CHANGE Sep 19 11:08:52 rh5cluster1 openais[5314]: [CLM ] New Configuration: Sep 19 11:08:52 rh5cluster1 openais[5314]: [CLM ] r(0) ip(10.10.1.13) Sep 19 11:08:52 rh5cluster1 fenced[5340]: rh5cluster2.gsslab.rdu.redhat.com not a cluster member after 0 sec post_fail_delay Sep 19 11:08:52 rh5cluster1 openais[5314]: [CLM ] Members Left: Sep 19 11:08:52 rh5cluster1 openais[5314]: [CLM ] no interface found for nodeid Sep 19 11:08:52 rh5cluster1 fenced[5340]: fencing node "rh5cluster2.gsslab.rdu.redhat.com" Sep 19 11:08:52 rh5cluster1 openais[5314]: [CLM ] Members Joined: Sep 19 11:08:52 rh5cluster1 openais[5314]: [SYNC ] This node is within the primary component and will provide service. Sep 19 11:08:52 rh5cluster1 openais[5314]: [CLM ] CLM CONFIGURATION CHANGE Sep 19 11:08:52 rh5cluster1 openais[5314]: [CLM ] New Configuration: Sep 19 11:08:52 rh5cluster1 openais[5314]: [CLM ] r(0) ip(10.10.1.13) Sep 19 11:08:52 rh5cluster1 openais[5314]: [CLM ] Members Left: Sep 19 11:08:52 rh5cluster1 openais[5314]: [CLM ] Members Joined: I don't know why rh5cluster2 just randomly dropped out, there were no network problems, though that time corresponds with the time that i did service clvmd start on rh5cluster2. The servcie clvmd start hung for a while, but succeeded, with a bunch of errors in /var/log/messages Sep 19 11:18:13 rh5cluster2 kernel: dlm: clvmd: recover 7 Sep 19 11:18:13 rh5cluster2 kernel: dlm: clvmd: add member 1 Sep 19 11:18:13 rh5cluster2 kernel: dlm: clvmd: config mismatch: 32,0 nodeid 1: 0,0 Sep 19 11:18:13 rh5cluster2 kernel: dlm: clvmd: ping_members aborted -22 last nodeid 1 Sep 19 11:18:13 rh5cluster2 kernel: dlm: clvmd: total members 2 error -22 Sep 19 11:18:13 rh5cluster2 kernel: dlm: clvmd: recover_members failed -22 Sep 19 11:18:13 rh5cluster2 kernel: dlm: clvmd: recover 7 error -22 Sep 19 11:18:15 rh5cluster2 kernel: dlm: lockspace 20001 from 1 not found Sep 19 11:18:47 rh5cluster2 last message repeated 55 times and everything still in JOIN_START_WAIT But isn't that what you wanted? everything waiting for the node to be fenced ? Yes, but thats not what's happening. Node1 starts Node1 tries to fence node2 because its not up Node2 joins Node1 stops trying to fence node2 Node2 starts fenced, it succeeds Node1 starts clvm, it succeeds Node2 starts clvm, it hangs Node1 thinks that node2 has gone out to lunch and tries to fence again Node2 eventually clvm succeeds All services in JOIN_START_WAIT node1 shouldn't stop trying to fenced node2 until its successful imho, but thats what is happening. Ok i'm going to close this, every other bug I hit isn't related to this as this is working properly, having the node be allowed to join and stop the fencing action is proper. Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed. |