Bug 207004

Summary: RHEL5 cman allows node to join cluster even though it's being fenced.
Product: Red Hat Enterprise Linux 5 Reporter: Josef Bacik <jbacik>
Component: cmanAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED NOTABUG QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.0CC: cluster-maint, o.martinez, teigland
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-09-19 15:08:05 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Josef Bacik 2006-09-18 18:36:07 UTC
Description of problem:
If you setup a 2 node cluster, node1 and node2, and start cman on node1 and wait
for it to try to fence node2 because it hasn't joined within the post_join_delay
time period, if you're fence script is like mine and is messed up, fenced will
fail to fence the node and will continue trying, which is expected.  However, if
you start cman on node2, node1 will allow it to join, even though it is
currently trying to fence that node.  Everything will work perfectly, and fenced
will continue on in the background trying to fence that node.

Version-Release number of selected component (if applicable):
[root@rh3cluster1 ~]# rpm -q cman
cman-2.0.15-3.fc6

How reproducible:
Every time

Steps to Reproduce:
1.start cman on node1, wait until it starts to try and fence node 2
2.make sure your fence script will fail to fence the node
3.start cman on node2
  
Actual results:
Node2 will join properly and it will work

Expected results:
node2 shouldn't be allowed to join until the fencing script succeeds.

Additional info:
This is how RHEL4 works and I assume is the correct behavior.  I'm going to try
on 2.0.16 when it finishes getting built, and if its no longer a problem I will
close this ticket.

Comment 1 Christine Caulfield 2006-09-19 07:18:13 UTC
Joining the cluster is fine. What should /not/ happen is that an unfenced node
re-joins the fence domain.

That would be a bug in the fencing daemon I suspect.

Comment 2 Josef Bacik 2006-09-19 12:27:51 UTC
I think the fence domain was in JOIN_START_WAIT or something like that, the
thing that worried me was that rgmanager started fine.  I will upgrade
everything this morning and try again and see if it still happens.

Comment 3 Christine Caulfield 2006-09-19 12:40:15 UTC
JOIN_START_WAIT sounds fine then. 
Maybe rgmanager needs to check the fenced flag in cman ?


I'm not sure why it shoud be different though, because RHEL4 cman allowed
unfenced nodes to join the cman layer of the cluster too. fence & rgmanager are
layered above the cluster manager.

Comment 4 Josef Bacik 2006-09-19 14:57:15 UTC
Sep 19 10:55:03 rh5cluster1 fenced[5340]: rh5cluster2.gsslab.rdu.redhat.com not
a cluster member after 3 sec post_join_delay
Sep 19 10:55:03 rh5cluster1 fenced[5340]: fencing node
"rh5cluster2.gsslab.rdu.redhat.com"
Sep 19 10:55:08 rh5cluster1 fenced[5340]: agent "fence_apc" reports: failed:
unrecognised menu response
Sep 19 10:55:08 rh5cluster1 fenced[5340]: fence
"rh5cluster2.gsslab.rdu.redhat.com" failed
Sep 19 10:55:13 rh5cluster1 fenced[5340]: fencing node
"rh5cluster2.gsslab.rdu.redhat.com"
Sep 19 10:55:15 rh5cluster1 fenced[5340]: agent "fence_apc" reports: failed:
unrecognised menu response
Sep 19 10:55:15 rh5cluster1 fenced[5340]: fence
"rh5cluster2.gsslab.rdu.redhat.com" failed


[root@rh5cluster2 ~]# cman_tool services
type             level name     id       state       
fence            0     default  00010001 none        
[1 2]

[root@rh5cluster2 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M     88   2006-09-19 10:55:18  rh5cluster1.gsslab.rdu.redhat.com
   2   M     84   2006-09-19 10:55:18  rh5cluster2.gsslab.rdu.redhat.com

and now rh5cluster1 is no longer trying to fence rh5cluster2.

Comment 5 Josef Bacik 2006-09-19 15:07:10 UTC
I guess i like this better, that way if the fencing fails you don't have this
hang because its trying to fence a node that took too long to bootup. 
rh5cluster1 has stopped trying to fence the node after the node joined the
cluster, which works for me.

Comment 6 Josef Bacik 2006-09-19 15:20:05 UTC
Well crap looks like fenced kicked back on when I tried to start clvm, now we're
back in a messed up state

[root@rh5cluster1 ~]# cman_tool services
type             level name     id       state       
fence            0     default  00010001 JOIN_START_WAIT
[1 2]
dlm              1     clvmd    00020001 JOIN_START_WAIT
[1 2]

tho it looks like this is the reason fenced kicked back in

Sep 19 11:08:52 rh5cluster1 openais[5314]: [TOTEM] Sending initial ORF token
Sep 19 11:08:52 rh5cluster1 openais[5314]: [CLM  ] CLM CONFIGURATION CHANGE
Sep 19 11:08:52 rh5cluster1 openais[5314]: [CLM  ] New Configuration:
Sep 19 11:08:52 rh5cluster1 openais[5314]: [CLM  ]      r(0) ip(10.10.1.13)
Sep 19 11:08:52 rh5cluster1 fenced[5340]: rh5cluster2.gsslab.rdu.redhat.com not
a cluster member after 0 sec post_fail_delay
Sep 19 11:08:52 rh5cluster1 openais[5314]: [CLM  ] Members Left:
Sep 19 11:08:52 rh5cluster1 openais[5314]: [CLM  ]      no interface found for
nodeid
Sep 19 11:08:52 rh5cluster1 fenced[5340]: fencing node
"rh5cluster2.gsslab.rdu.redhat.com"
Sep 19 11:08:52 rh5cluster1 openais[5314]: [CLM  ] Members Joined:
Sep 19 11:08:52 rh5cluster1 openais[5314]: [SYNC ] This node is within the
primary component and will provide service.
Sep 19 11:08:52 rh5cluster1 openais[5314]: [CLM  ] CLM CONFIGURATION CHANGE
Sep 19 11:08:52 rh5cluster1 openais[5314]: [CLM  ] New Configuration:
Sep 19 11:08:52 rh5cluster1 openais[5314]: [CLM  ]      r(0) ip(10.10.1.13)
Sep 19 11:08:52 rh5cluster1 openais[5314]: [CLM  ] Members Left:
Sep 19 11:08:52 rh5cluster1 openais[5314]: [CLM  ] Members Joined:

I don't know why rh5cluster2 just randomly dropped out, there were no network
problems, though that time corresponds with the time that i did

service clvmd start

on rh5cluster2.  The servcie clvmd start hung for a while, but succeeded, with a
bunch of errors in /var/log/messages

Sep 19 11:18:13 rh5cluster2 kernel: dlm: clvmd: recover 7
Sep 19 11:18:13 rh5cluster2 kernel: dlm: clvmd: add member 1
Sep 19 11:18:13 rh5cluster2 kernel: dlm: clvmd: config mismatch: 32,0 nodeid 1: 0,0
Sep 19 11:18:13 rh5cluster2 kernel: dlm: clvmd: ping_members aborted -22 last
nodeid 1
Sep 19 11:18:13 rh5cluster2 kernel: dlm: clvmd: total members 2 error -22
Sep 19 11:18:13 rh5cluster2 kernel: dlm: clvmd: recover_members failed -22
Sep 19 11:18:13 rh5cluster2 kernel: dlm: clvmd: recover 7 error -22
Sep 19 11:18:15 rh5cluster2 kernel: dlm: lockspace 20001 from 1 not found
Sep 19 11:18:47 rh5cluster2 last message repeated 55 times

and everything still in JOIN_START_WAIT

Comment 7 Christine Caulfield 2006-09-19 15:39:48 UTC
But isn't that what you wanted? everything waiting for the node to be fenced ?

Comment 8 Josef Bacik 2006-09-19 15:44:03 UTC
Yes, but thats not what's happening.

Node1 starts
Node1 tries to fence node2 because its not up
Node2 joins
Node1 stops trying to fence node2
Node2 starts fenced, it succeeds
Node1 starts clvm, it succeeds
Node2 starts clvm, it hangs
Node1 thinks that node2 has gone out to lunch and tries to fence again
Node2 eventually clvm succeeds
All   services in JOIN_START_WAIT

node1 shouldn't stop trying to fenced node2 until its successful imho, but thats
what is happening.

Comment 9 Josef Bacik 2006-09-19 15:51:27 UTC
Ok i'm going to close this, every other bug I hit isn't related to this as this
is working properly, having the node be allowed to join and stop the fencing
action is proper.

Comment 10 Nate Straz 2007-12-13 17:22:20 UTC
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.