Bug 133512

Summary: node gets different view of cluster and leaves
Product: [Retired] Red Hat Cluster Suite Reporter: Corey Marthaler <cmarthal>
Component: gfsAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED NEXTRELEASE QA Contact: GFS Bugs <gfs-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: amanthei, ccaulfie, djansa
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-06-13 19:43:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 142853, 144795    
Attachments:
Description Flags
logs from test run
none
logs of test run none

Description Corey Marthaler 2004-09-24 16:01:09 UTC
Description of problem:
This happened right after a reboot of all the nodes.

all the cluster.conf files are the same
start ccsd on all nodes
cman_join on all
and then suddenly morph-05 decides he wants to leave:

morph-05:
Sep 23 16:15:52 morph-05 kernel: CMAN: Waiting to join or form a
Linux-cluster
Sep 23 16:16:10 morph-05 kernel: CMAN: sending membership request
Sep 23 16:16:10 morph-05 kernel: CMAN: got node morph-01
Sep 23 16:16:15 morph-05 kernel: CMAN: got node morph-03
Sep 23 16:16:15 morph-05 kernel: CMAN: got node morph-06
Sep 23 16:16:20 morph-05 kernel: CMAN: got node morph-04
Sep 23 16:16:21 morph-05 kernel: CMAN: we are leaving the cluster

morph-06:
ep 23 16:16:58 morph-06 kernel: CMAN: Waiting to join or form a
Linux-cluster
Sep 23 16:17:15 morph-06 kernel: CMAN: sending membership request
Sep 23 16:17:23 morph-06 kernel: CMAN: got node morph-05
Sep 23 16:17:23 morph-06 kernel: CMAN: got node morph-01
Sep 23 16:17:25 morph-06 kernel: CMAN: got node morph-04
Sep 23 16:17:26 morph-06 kernel: CMAN: Node morph-05 is leaving the
cluster, 
reason 5
Sep 23 16:17:48 morph-06 kernel: CMAN: got node morph-02
Sep 23 16:17:49 morph-06 kernel: CMAN: quorum regained, resuming activity
Sep 23 16:17:53 morph-06 kernel: CMAN: got node morph-03

from code:
REASON 5 = /* Our view of the cluster is in a minority */

[root@morph-01 root]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    6   M   morph-01
   2    1    6   X   morph-05
   3    1    6   M   morph-06
   4    1    6   M   morph-04
   5    1    6   M   morph-02
   6    1    6   M   morph-03

After seeing that my fence_tool join failed I realize that morph-05
left and I try to join once again and it this time worked.

I'll try to reproduce this so that we can see how it got a different
view of the cluster.

Comment 1 Christine Caulfield 2004-09-27 07:51:29 UTC
If you can reproduce it can you also include the contents of
/proc/cluster/status please ?

Comment 2 Corey Marthaler 2004-10-04 20:44:06 UTC
I reproduced a similar senario to this bug.
I did a cman_tool join on all nodes, and morph-04 never joined, but
instead of the "we are leaving the cluster" message, it gave
"kernel: CMAN: Been in JOINWAIT for too long - giving up" even though
every other node had joined just fine.

Here is what /proc/cluster/status showed on morph-04:

[root@morph-04 root]# cat /proc/cluster/status
Version: 2.0.1
Config version: 1
Cluster name: morph-cluster
Cluster ID: 41652
Membership state: Not-in-Cluster

And here is one of the live nodes:
[root@morph-03 root]# cat /proc/cluster/status
Version: 2.0.1
Config version: 1
Cluster name: morph-cluster
Cluster ID: 41652
Membership state: Cluster-Member
Nodes: 4
Expected_votes: 5
Total_votes: 4
Quorum: 3
Active subsystems: 0
Node addresses: 192.168.44.63

I then attempted another cman_tool join and morph-04 finally joined
and all was happy.

Comment 3 Christine Caulfield 2004-10-05 12:35:35 UTC
I doubt this is related I have spotted (and fixed) a case where a node
could give up too early in the join if it's request messages get lost
on the network. But this can still happen if the cluster is in
transition for a long time whilethe node is wanting to join and some
messages go missing. It might be worth increasing the timeout in some
circumstances but I'm not totally convinced.

Comment 4 Corey Marthaler 2004-11-09 17:55:29 UTC
I reproduced this again:

Nov  9 12:48:39 morph-05 kernel: CMAN: Waiting to join or form a
Linux-cluster
Nov  9 12:48:58 morph-05 kernel: CMAN: sending membership request
Nov  9 12:48:58 morph-05 kernel: CMAN: got node morph-01
Nov  9 12:49:03 morph-05 kernel: CMAN: got node morph-02
Nov  9 12:49:03 morph-05 kernel: CMAN: got node morph-03
CMAN: quorum regained, resuming activity
Nov  9 12:49:06 morph-05 kernel: CMAN: quorum regained, resuming activity
Nov  9 12:49:08 morph-05 kernel: CMAN: got node morph-04

Nov  9 12:49:09 morph-05 kernel: CMAN: we are leaving the cluster.
Reason is 5
Nov  9 12:49:09 morph-05 kernel:


[root@morph-05 root]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name

[root@morph-05 root]# cat /proc/cluster/status
Version: 3.0.1
Config version: 1
Cluster name: morph-cluster
Cluster ID: 41652
Membership state: Not-in-Cluster


status from a node in the cluster:

[root@morph-04 root]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    5   M   morph-01
   2    1    5   X   morph-05
   3    1    5   M   morph-03
   4    1    5   M   morph-04
   5    1    5   M   morph-02


[root@morph-04 root]# cat /proc/cluster/status
Version: 3.0.1
Config version: 1
Cluster name: morph-cluster
Cluster ID: 41652
Membership state: Cluster-Member
Nodes: 4
Expected_votes: 5
Total_votes: 4
Quorum: 3
Active subsystems: 0
Node addresses: 192.168.44.64



Comment 5 Christine Caulfield 2004-11-10 11:51:49 UTC
Got it!

The node had a pending join, so when it came to compare node states
the master though that nodes was dead and the local one thought it was
joining. 

As far as this comparison goes "joining" is the same as "dead"

Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v  <--  membership.c
new revision: 1.30; previous revision: 1.29
done


Comment 6 Corey Marthaler 2004-11-17 19:58:08 UTC
both morph-04 and morph-05 appeared to hit this today after morph-01
and morph-02 rejoined after having been shot:

Nov 17 13:47:31 morph-04 kernel: CMAN: we are leaving the cluster.
Reason is 5
Nov 17 13:47:31 morph-04 kernel:
Nov 17 13:47:31 morph-04 kernel: dlm: closing connection to node 2
Nov 17 13:47:31 morph-04 kernel: dlm: closing connection to node 3
Nov 17 13:47:31 morph-04 kernel: dlm: closing connection to node 4
Nov 17 13:47:31 morph-04 kernel: SM: 00000001 sm_stop: SG still joined
Nov 17 13:47:31 morph-04 kernel: SM: 01000003 sm_stop: SG still joined
Nov 17 13:47:31 morph-04 kernel: SM: 02000005 sm_stop: SG still joined
Nov 17 13:47:35 morph-04 kernel: dlm: dlm_unlock: lkid 30389 lockspace
not found


Nov 17 13:48:03 morph-05 kernel: CMAN: we are leaving the cluster.
Reason is 5
Nov 17 13:48:03 morph-05 kernel:
Nov 17 13:48:03 morph-05 kernel: dlm: closing connection to node 2
Nov 17 13:48:03 morph-05 kernel: dlm: closing connection to node 3
Nov 17 13:48:03 morph-05 kernel: dlm: closing connection to node 4
Nov 17 13:48:03 morph-05 kernel: dlm: process_cluster_request invalid
lockspace 1000004 from 4 req 9
Nov 17 13:48:03 morph-05 kernel: SM: 00000001 sm_stop: SG still joined
Nov 17 13:48:03 morph-05 kernel: 32 "      11         88b0a0f"


Comment 7 Christine Caulfield 2004-11-19 08:55:47 UTC
I've added an assert to the code. If you see a "BUG() at line 244 of
membership.c" can you post it here along with the last few messages of
the rest of the nodes please.

Comment 8 Christine Caulfield 2004-11-19 14:55:20 UTC
If a node died in joinconf then it only got marked dead on the local
node. That could, possibly cause this bug. Feel free to punt this back
if you see it again :)

Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v  <--  membership.c
new revision: 1.35; previous revision: 1.34
done


Comment 9 Adam "mantis" Manthei 2004-12-14 22:04:58 UTC
I keep tripping the assert (see bug #142853). On my 8 node setup, I see the
BUG() every once in a while.  However, I see this bug almost every time when I
run "cman_tool join" on all the nodes at the same time.  

Changing status to ASSIGNED as this is still not fixed.

Comment 10 Adam "mantis" Manthei 2004-12-14 22:41:00 UTC
Created attachment 108576 [details]
logs from test run

demonstration of BUG assertion: 
"kernel BUG at line memebership.c:244!"

node trin-07 is the node that bombed out

Comment 11 Christine Caulfield 2004-12-22 11:49:14 UTC
*** Bug 142853 has been marked as a duplicate of this bug. ***

Comment 12 Christine Caulfield 2004-12-22 11:50:40 UTC
Ok, it passes my tests now, lets see how it fairs on yours !

Comment 13 Adam "mantis" Manthei 2004-12-22 19:18:25 UTC
Things seem to be a little better.  I'm not seeing the BUG() anymore,
but it seems that the the nodes still aren't able to join the cluster
100% of the time.  This might be another bug though... need to
investigate that further.  In the meantime, this is the test I've been
using:

#!/bin/bash
i=0
while echo "$(date): iteration $i"
do
        action="starting"
        echo $action cman
        broadcast root@trin-0{1,2,3,4,6,7,8,9} -- "/etc/init.d/cman
start" || break
        broadcast root@trin-0{1,2,3,4,6,7,8,9} -- "grep \$(hostname
-s) /proc/cluster/nodes" || break

        action="stopping"
        echo $action cman
        broadcast root@trin-0{1,2,3,4,6,7,8,9} -- "/etc/init.d/cman
stop" || break
        broadcast root@trin-0{1,2,3,4,6,7,8,9} -- "! [ -f
/proc/cluster ]" || break

        i=$(($i+1))
done

echo error detected $action after $i itrations


Comment 14 Adam "mantis" Manthei 2004-12-22 19:24:28 UTC
moving back to ASSIGNED since cman is still exhibiting the same
behavior as originally described

Comment 15 Adam "mantis" Manthei 2004-12-22 19:48:18 UTC
The script from comment #13 caused one of my 8 nodes (trin-03) to fail
to start cman properlly.  While trying to figure out what was going, I
cat'ed /proc/cluster/status and caused my kernel to oops:

[root@trin-03 ~]# cat /proc/cluster/nodes 
Node  Votes Exp Sts  Name
   1    1    9   X   trin-01
   3    1    9   X   trin-04
   4    1    9   X   trin-06
   5    1    9   X   trin-08
   6    1    9   X   trin-02
   7    1    9   X   trin-07
   8    1    9   X   trin-09

[root@trin-03 ~]# cat /proc/cluster/status <-- caused oops
Segmentation fault

[root@trin-03 ~]# dmesg
CMAN <CVS> (built Dec 22 2004 10:54:49) installed
NET: Registered protocol family 30
CMAN: Waiting to join or form a Linux-cluster
CMAN: sending membership request
CMAN: sending membership request
CMAN: got node trin-01
CMAN: got node trin-04
CMAN: got node trin-06
CMAN: got node trin-08
CMAN: quorum regained, resuming activity
CMAN: got node trin-07
CMAN: got node trin-02
CMAN: nmembers in HELLO message from 3 does not match our view (got 5,
exp 6)
CMAN: node trin-01 is not responding - removing from the cluster
CMAN: got node trin-07
CMAN: got node trin-09
CMAN: nmembers in HELLO message from 6 does not match our view (got 6,
exp 7)
CMAN: node trin-01 rejoining
CMAN: node trin-08 is not responding - removing from the cluster
CMAN: node trin-01 is not responding - removing from the cluster
CMAN: node trin-01 rejoining
CMAN: we are leaving the cluster. Reason is 5

CMAN: Waiting to join or form a Linux-cluster
CMAN: sending membership request
CMAN: sending membership request
CMAN: sending membership request
CMAN: sending membership request
CMAN: got node trin-09
CMAN: got node trin-07
CMAN: got node trin-02
CMAN: got node trin-08
CMAN: got node trin-06
CMAN: got node trin-01
CMAN: got node trin-04
Got ENDTRANS from a node not the master: master: 950150984, sender: 3
CMAN: node trin-04 is not responding - removing from the cluster
CMAN: node trin-06 is not responding - removing from the cluster
CMAN: node trin-08 is not responding - removing from the cluster
CMAN: node trin-02 is not responding - removing from the cluster
CMAN: node trin-07 is not responding - removing from the cluster
CMAN: node trin-09 is not responding - removing from the cluster
Unable to handle kernel paging request at virtual address 6c636e69
 printing eip:
f89f8444
*pde = 00000000
Oops: 0000 [#1]
Modules linked in: cman(U) parport_pc lp parport autofs4 i2c_dev
i2c_core sunrpc md5 ipv6 dm_mod button battery ac uhci_hcd hw_random
e1000 f
loppy ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
CPU:    0
EIP:    0060:[<f89f8444>]    Tainted: GF     VLI
EFLAGS: 00010282   (2.6.9-1.906_EL) 
EIP is at proc_cluster_status+0x29c/0x2d0 [cman]
eax: 0000000f   ebx: f8a001ca   ecx: 6c636e69   edx: f8a00201
esi: 00000000   edi: ebcae071   ebp: 000000ec   esp: f5b0ef2c
ds: 007b   es: 007b   ss: 0068
Process cat (pid: 15827, threadinfo=f5b0e000 task=ebcc00b0)
Stack: ebcae0dd 00000001 00000000 ebcc00b0 00000000 6c636e69 ebcae000
f6721c00 
       f5e31c80 ebcae000 00000400 c019f585 00000400 f6721c00 00000000
00000400 
       08b49858 00000000 00000000 c0352920 f5e31c80 00000400 f5b0efac
c01621fe 
Call Trace:
 [<c019f585>] proc_file_read+0x97/0x225
 [<c01621fe>] vfs_read+0xb6/0xe2
 [<c0162411>] sys_read+0x3c/0x62
 [<c0301bfb>] syscall_call+0x7/0xb
Code: 0f b6 42 01 50 8b 54 24 20 0f b6 42 0c 50 68 f5 01 a0 f8 ff 74
24 14 e8 0c 00 7e c7 01 c5 83 c4 18 8b 4c 24 14 8b 09 89 4c 24 14 <8b> 0
1 0f 18 00 90 a1 3c b2 a0 f8 83 c0 0c 39 c1 e9 df fe ff ff 

On some of the other nodes I saw the following:
Got ENDTRANS from a node not the master: master: 5, sender: -1

Comment 16 Adam "mantis" Manthei 2004-12-22 19:49:07 UTC
Created attachment 109043 [details]
logs of test run 

ful logs for comment #15

Comment 17 Christine Caulfield 2005-01-10 09:21:50 UTC
*** Bug 144180 has been marked as a duplicate of this bug. ***

Comment 18 Christine Caulfield 2005-01-11 15:55:41 UTC
*** Bug 142984 has been marked as a duplicate of this bug. ***

Comment 19 Christine Caulfield 2005-01-13 13:40:55 UTC
Clear out joining node if we get NOMINATEd master. Lets see how long
this one lasts.

Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v  <--  membership.c
new revision: 1.46; previous revision: 1.45
done


Comment 20 Christine Caulfield 2005-01-18 14:42:01 UTC
This may also help:

Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v  <--  membership.c
new revision: 1.48; previous revision: 1.47
done


Comment 21 Corey Marthaler 2005-06-13 19:43:24 UTC
haven't seen this bug in 5 months since it was fixed.