Bug 133376 - cluster view confused after nodes leave and rejoin.
cluster view confused after nodes leave and rejoin.
Status: CLOSED NEXTRELEASE
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: gfs (Show other bugs)
4
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: Christine Caulfield
GFS Bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2004-09-23 12:24 EDT by Dean Jansa
Modified: 2010-01-11 21:58 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-04-13 12:11:39 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Dean Jansa 2004-09-23 12:24:08 EDT
6 node cluster, ccsd running and all cman_tool join'ed. 
 
Running worldview, which plays with cluster membership and verifies 
all nodes agree on the view after each change...  I managed to get 
the cluster in a confused state. 
 
The steps (starting from a 6 node quorate cluster): 
 
Take 2 nodes out (in this case morph-06 and morph-03). 
Wait for all other nodes to get a cluster message. 
Check that all nodes see the same cluster view (they did). 
Rejoin morph-06 and morph-03, one at a time, started with morph-06. 
Wait for cluster message (never comes). 
 
Go see why things are hanging...  And find that morph-06 had done 
a cman_tool join, which returned 0.  Running cman_tool join 
again shows that it thinks "Node is already active." 
All other nodes in cluster see morph-06 and morph-03 as expired. 
 
morph-06, sees no other nodes, yet thinks it is active. 
 
Can reproduce every few iterations. 
 
 
[root@morph-01 bin]# cat /proc/cluster/nodes 
Node  Votes Exp Sts  Name 
   1    1    6   M   morph-01 
   2    1    6   X   morph-06 
   3    1    6   X   morph-03 
   4    1    6   M   morph-04 
   5    1    6   M   morph-05 
   6    1    6   M   morph-02 
 
 
On morph-06: 
 
[root@morph-06 bin]# cman_tool join 
Node is already active 
 
[root@morph-06 bin]# cat /proc/cluster/nodes 
Node  Votes Exp Sts  Name 
 
 
Version-Release number of selected component (if applicable): 
 
module:  CMAN <CVS> (built Sep 23 2004 10:12:29) installed  
 
[root@morph-02 root]# cman_tool -V cman_tool DEVEL.1095952309 (built 
Sep 23 2004 10:13:00) Copyright (C) Red Hat, Inc.  2004  All rights 
reserved.  
 
How reproducible: 
Sometimes
Comment 1 Christine Caulfield 2004-09-27 10:26:51 EDT
I suspect it's in a JOIN state. When cman_tool says it's active that
does not mean it's in the cluster, it just means that you can't do
another join because the other one is still active.

Can you get the output of /proc/cluster/status on the joining node(s)
and maybe /proc/cluster/nodes on the node that sees the membership
request if it arrives - that will be the node that says "rejoining".
In fact I'd be interested to know if any other nodes see that request
at all, it could be that the join requests are not arriving. Have a
look in /proc/cluster/status of some nodes that /are/ in the cluster
and see if they have gone into transition.

I'll try to reproduce it here but the number of times I've sucessfully
taken 2 nodes down and rebooted them successfully, I don't hold out
much hope!

I'll see if I can beef up the output of /proc/cluster/status for nodes
that are not in the cluster. That could be very useful.
Comment 2 Dean Jansa 2004-09-27 17:12:20 EDT
I can get into this state with 3 nodes and a simple cman_tool leave 
after todays build: 
 
cman_tool join on all 3 nodes: 
 
On one node:  cman_tool leave 
 
The other nodes do not get the cluster messages until one node 
notices the missing HELLO. 
 
At which point you see one node has the node marked dead with a 
normal_shutdown and the other node will have it marked dead with a 
dead reason.  Now a rejoin, which will work. 
 
Then run: cman_tool leave && cman_tool join 
 
One node gets: 
CMAN: Rejecting cluster membership application from tank-06 - 
already have a node with that name 
 
CMAN: no HELLO from tank-06, removing from the cluster 
 
This node sees the message and has this nodeinfo: 
tank-06 (1): state - dead, leave_reason - dead 
 
The other node gets the cluster message and has this nodeinfo: 
tank-06 (1): state - dead, leave_reason - normal_shutdown 
 
And you are left with the following status: 
 
[root@tank-01 cman_sanity]# cat /proc/cluster/status 
Version: 2.0.1 
Config version: 3 
Cluster name: tank-cluster 
Cluster ID: 46516 
Membership state: Cluster-Member 
Nodes: 2 
Expected_votes: 3 
Total_votes: 2 
Quorum: 2 
Active subsystems: 0 
Node addresses: 192.168.44.91 
 
 
[root@tank-05 cman_sanity]# cat /proc/cluster/status 
Version: 2.0.1 
Config version: 3 
Cluster name: tank-cluster 
Cluster ID: 46516 
Membership state: Cluster-Member 
Nodes: 2 
Expected_votes: 3 
Total_votes: 2 
Quorum: 2 
Active subsystems: 0 
Node addresses: 192.168.44.95 
 
[root@tank-06 cman_sanity]# cat /proc/cluster/status 
Version: 2.0.1 
Config version: 3 
Cluster name: tank-cluster 
Cluster ID: 46516 
Membership state: Joining 
 
 
 
root@tank-01 cman_sanity]# cman_tool -V 
cman_tool DEVEL.1096318402 (built Sep 27 2004 15:54:29) 
Copyright (C) Red Hat, Inc.  2004  All rights reserved. 
 
CMAN <CVS> (built Sep 27 2004 15:53:25) installed 
 
 
Comment 3 Christine Caulfield 2004-09-28 04:05:02 EDT
Aha, that makes sense. It seems like a lot of network packets are 
getting lost. cman should cope with this, of course (but can we get
our network checked?)

This checkin should make it more resilient to dodgy networking.

Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v  <--  membership.c
new revision: 1.22; previous revision: 1.21
done
Comment 4 Dean Jansa 2004-10-11 15:36:08 EDT
I've been trying to reproduce this, but have hit the following twice 
now: 
 
A node leaves the cluster, all other node get a cluster event and 
agree on the cluster membership.  The node which just left now 
rejoins the cluster.  While this node is in Join-Wait the other 
nodes seem to miss the leave event and output: 
 
CMAN: node tank-05 is not responding - removing from the cluster 
 
Then the node which was trying to rejoin outputs: 
 
CMAN: Error registering barrier: -107 
CMAN: too many transition restarts - will die 
 
Right after the barrier error the nodes state is: 
[root@tank-05 root]# cat /proc/cluster/status  
Version: 2.0.1 
Config version: 3 
Cluster name: tank-cluster 
Cluster ID: 46516 
Membership state: Transition-Master 
 
Ideas as to why the other nodes want to remove the joining node from 
the cluster?  Seems like a message is being lost somewhere? 
 
Comment 5 Dean Jansa 2004-10-11 16:14:15 EDT
The above was hit while running the following versions: 
 
[root@tank-01 root]# cman_tool -V 
cman_tool DEVEL.1097504146 (built Oct 11 2004 09:16:55) 
Copyright (C) Red Hat, Inc.  2004  All rights reserved. 
 
CMAN <CVS> (built Oct 11 2004 09:15:50) installed 
 
Comment 6 Kiersten (Kerri) Anderson 2004-11-16 14:08:50 EST
Updating version to the right level in the defects.  Sorry for the storm.
Comment 7 Dean Jansa 2005-04-13 12:11:39 EDT
This test is no longer having issues.  This test no longer sees the origional
issue, nor the issue in comment #4.

Note You need to log in before you can comment on or make changes to this bug.