Bug 133512
Summary: | node gets different view of cluster and leaves | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> | ||||||
Component: | gfs | Assignee: | Christine Caulfield <ccaulfie> | ||||||
Status: | CLOSED NEXTRELEASE | QA Contact: | GFS Bugs <gfs-bugs> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 4 | CC: | amanthei, ccaulfie, djansa | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | i686 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2005-06-13 19:43:24 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 142853, 144795 | ||||||||
Attachments: |
|
Description
Corey Marthaler
2004-09-24 16:01:09 UTC
If you can reproduce it can you also include the contents of /proc/cluster/status please ? I reproduced a similar senario to this bug. I did a cman_tool join on all nodes, and morph-04 never joined, but instead of the "we are leaving the cluster" message, it gave "kernel: CMAN: Been in JOINWAIT for too long - giving up" even though every other node had joined just fine. Here is what /proc/cluster/status showed on morph-04: [root@morph-04 root]# cat /proc/cluster/status Version: 2.0.1 Config version: 1 Cluster name: morph-cluster Cluster ID: 41652 Membership state: Not-in-Cluster And here is one of the live nodes: [root@morph-03 root]# cat /proc/cluster/status Version: 2.0.1 Config version: 1 Cluster name: morph-cluster Cluster ID: 41652 Membership state: Cluster-Member Nodes: 4 Expected_votes: 5 Total_votes: 4 Quorum: 3 Active subsystems: 0 Node addresses: 192.168.44.63 I then attempted another cman_tool join and morph-04 finally joined and all was happy. I doubt this is related I have spotted (and fixed) a case where a node could give up too early in the join if it's request messages get lost on the network. But this can still happen if the cluster is in transition for a long time whilethe node is wanting to join and some messages go missing. It might be worth increasing the timeout in some circumstances but I'm not totally convinced. I reproduced this again: Nov 9 12:48:39 morph-05 kernel: CMAN: Waiting to join or form a Linux-cluster Nov 9 12:48:58 morph-05 kernel: CMAN: sending membership request Nov 9 12:48:58 morph-05 kernel: CMAN: got node morph-01 Nov 9 12:49:03 morph-05 kernel: CMAN: got node morph-02 Nov 9 12:49:03 morph-05 kernel: CMAN: got node morph-03 CMAN: quorum regained, resuming activity Nov 9 12:49:06 morph-05 kernel: CMAN: quorum regained, resuming activity Nov 9 12:49:08 morph-05 kernel: CMAN: got node morph-04 Nov 9 12:49:09 morph-05 kernel: CMAN: we are leaving the cluster. Reason is 5 Nov 9 12:49:09 morph-05 kernel: [root@morph-05 root]# cat /proc/cluster/nodes Node Votes Exp Sts Name [root@morph-05 root]# cat /proc/cluster/status Version: 3.0.1 Config version: 1 Cluster name: morph-cluster Cluster ID: 41652 Membership state: Not-in-Cluster status from a node in the cluster: [root@morph-04 root]# cat /proc/cluster/nodes Node Votes Exp Sts Name 1 1 5 M morph-01 2 1 5 X morph-05 3 1 5 M morph-03 4 1 5 M morph-04 5 1 5 M morph-02 [root@morph-04 root]# cat /proc/cluster/status Version: 3.0.1 Config version: 1 Cluster name: morph-cluster Cluster ID: 41652 Membership state: Cluster-Member Nodes: 4 Expected_votes: 5 Total_votes: 4 Quorum: 3 Active subsystems: 0 Node addresses: 192.168.44.64 Got it! The node had a pending join, so when it came to compare node states the master though that nodes was dead and the local one thought it was joining. As far as this comparison goes "joining" is the same as "dead" Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/membership.c,v <-- membership.c new revision: 1.30; previous revision: 1.29 done both morph-04 and morph-05 appeared to hit this today after morph-01 and morph-02 rejoined after having been shot: Nov 17 13:47:31 morph-04 kernel: CMAN: we are leaving the cluster. Reason is 5 Nov 17 13:47:31 morph-04 kernel: Nov 17 13:47:31 morph-04 kernel: dlm: closing connection to node 2 Nov 17 13:47:31 morph-04 kernel: dlm: closing connection to node 3 Nov 17 13:47:31 morph-04 kernel: dlm: closing connection to node 4 Nov 17 13:47:31 morph-04 kernel: SM: 00000001 sm_stop: SG still joined Nov 17 13:47:31 morph-04 kernel: SM: 01000003 sm_stop: SG still joined Nov 17 13:47:31 morph-04 kernel: SM: 02000005 sm_stop: SG still joined Nov 17 13:47:35 morph-04 kernel: dlm: dlm_unlock: lkid 30389 lockspace not found Nov 17 13:48:03 morph-05 kernel: CMAN: we are leaving the cluster. Reason is 5 Nov 17 13:48:03 morph-05 kernel: Nov 17 13:48:03 morph-05 kernel: dlm: closing connection to node 2 Nov 17 13:48:03 morph-05 kernel: dlm: closing connection to node 3 Nov 17 13:48:03 morph-05 kernel: dlm: closing connection to node 4 Nov 17 13:48:03 morph-05 kernel: dlm: process_cluster_request invalid lockspace 1000004 from 4 req 9 Nov 17 13:48:03 morph-05 kernel: SM: 00000001 sm_stop: SG still joined Nov 17 13:48:03 morph-05 kernel: 32 " 11 88b0a0f" I've added an assert to the code. If you see a "BUG() at line 244 of membership.c" can you post it here along with the last few messages of the rest of the nodes please. If a node died in joinconf then it only got marked dead on the local node. That could, possibly cause this bug. Feel free to punt this back if you see it again :) Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/membership.c,v <-- membership.c new revision: 1.35; previous revision: 1.34 done I keep tripping the assert (see bug #142853). On my 8 node setup, I see the BUG() every once in a while. However, I see this bug almost every time when I run "cman_tool join" on all the nodes at the same time. Changing status to ASSIGNED as this is still not fixed. Created attachment 108576 [details]
logs from test run
demonstration of BUG assertion:
"kernel BUG at line memebership.c:244!"
node trin-07 is the node that bombed out
*** Bug 142853 has been marked as a duplicate of this bug. *** Ok, it passes my tests now, lets see how it fairs on yours ! Things seem to be a little better. I'm not seeing the BUG() anymore, but it seems that the the nodes still aren't able to join the cluster 100% of the time. This might be another bug though... need to investigate that further. In the meantime, this is the test I've been using: #!/bin/bash i=0 while echo "$(date): iteration $i" do action="starting" echo $action cman broadcast root@trin-0{1,2,3,4,6,7,8,9} -- "/etc/init.d/cman start" || break broadcast root@trin-0{1,2,3,4,6,7,8,9} -- "grep \$(hostname -s) /proc/cluster/nodes" || break action="stopping" echo $action cman broadcast root@trin-0{1,2,3,4,6,7,8,9} -- "/etc/init.d/cman stop" || break broadcast root@trin-0{1,2,3,4,6,7,8,9} -- "! [ -f /proc/cluster ]" || break i=$(($i+1)) done echo error detected $action after $i itrations moving back to ASSIGNED since cman is still exhibiting the same behavior as originally described The script from comment #13 caused one of my 8 nodes (trin-03) to fail to start cman properlly. While trying to figure out what was going, I cat'ed /proc/cluster/status and caused my kernel to oops: [root@trin-03 ~]# cat /proc/cluster/nodes Node Votes Exp Sts Name 1 1 9 X trin-01 3 1 9 X trin-04 4 1 9 X trin-06 5 1 9 X trin-08 6 1 9 X trin-02 7 1 9 X trin-07 8 1 9 X trin-09 [root@trin-03 ~]# cat /proc/cluster/status <-- caused oops Segmentation fault [root@trin-03 ~]# dmesg CMAN <CVS> (built Dec 22 2004 10:54:49) installed NET: Registered protocol family 30 CMAN: Waiting to join or form a Linux-cluster CMAN: sending membership request CMAN: sending membership request CMAN: got node trin-01 CMAN: got node trin-04 CMAN: got node trin-06 CMAN: got node trin-08 CMAN: quorum regained, resuming activity CMAN: got node trin-07 CMAN: got node trin-02 CMAN: nmembers in HELLO message from 3 does not match our view (got 5, exp 6) CMAN: node trin-01 is not responding - removing from the cluster CMAN: got node trin-07 CMAN: got node trin-09 CMAN: nmembers in HELLO message from 6 does not match our view (got 6, exp 7) CMAN: node trin-01 rejoining CMAN: node trin-08 is not responding - removing from the cluster CMAN: node trin-01 is not responding - removing from the cluster CMAN: node trin-01 rejoining CMAN: we are leaving the cluster. Reason is 5 CMAN: Waiting to join or form a Linux-cluster CMAN: sending membership request CMAN: sending membership request CMAN: sending membership request CMAN: sending membership request CMAN: got node trin-09 CMAN: got node trin-07 CMAN: got node trin-02 CMAN: got node trin-08 CMAN: got node trin-06 CMAN: got node trin-01 CMAN: got node trin-04 Got ENDTRANS from a node not the master: master: 950150984, sender: 3 CMAN: node trin-04 is not responding - removing from the cluster CMAN: node trin-06 is not responding - removing from the cluster CMAN: node trin-08 is not responding - removing from the cluster CMAN: node trin-02 is not responding - removing from the cluster CMAN: node trin-07 is not responding - removing from the cluster CMAN: node trin-09 is not responding - removing from the cluster Unable to handle kernel paging request at virtual address 6c636e69 printing eip: f89f8444 *pde = 00000000 Oops: 0000 [#1] Modules linked in: cman(U) parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc md5 ipv6 dm_mod button battery ac uhci_hcd hw_random e1000 f loppy ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<f89f8444>] Tainted: GF VLI EFLAGS: 00010282 (2.6.9-1.906_EL) EIP is at proc_cluster_status+0x29c/0x2d0 [cman] eax: 0000000f ebx: f8a001ca ecx: 6c636e69 edx: f8a00201 esi: 00000000 edi: ebcae071 ebp: 000000ec esp: f5b0ef2c ds: 007b es: 007b ss: 0068 Process cat (pid: 15827, threadinfo=f5b0e000 task=ebcc00b0) Stack: ebcae0dd 00000001 00000000 ebcc00b0 00000000 6c636e69 ebcae000 f6721c00 f5e31c80 ebcae000 00000400 c019f585 00000400 f6721c00 00000000 00000400 08b49858 00000000 00000000 c0352920 f5e31c80 00000400 f5b0efac c01621fe Call Trace: [<c019f585>] proc_file_read+0x97/0x225 [<c01621fe>] vfs_read+0xb6/0xe2 [<c0162411>] sys_read+0x3c/0x62 [<c0301bfb>] syscall_call+0x7/0xb Code: 0f b6 42 01 50 8b 54 24 20 0f b6 42 0c 50 68 f5 01 a0 f8 ff 74 24 14 e8 0c 00 7e c7 01 c5 83 c4 18 8b 4c 24 14 8b 09 89 4c 24 14 <8b> 0 1 0f 18 00 90 a1 3c b2 a0 f8 83 c0 0c 39 c1 e9 df fe ff ff On some of the other nodes I saw the following: Got ENDTRANS from a node not the master: master: 5, sender: -1 Created attachment 109043 [details] logs of test run ful logs for comment #15 *** Bug 144180 has been marked as a duplicate of this bug. *** *** Bug 142984 has been marked as a duplicate of this bug. *** Clear out joining node if we get NOMINATEd master. Lets see how long this one lasts. Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/membership.c,v <-- membership.c new revision: 1.46; previous revision: 1.45 done This may also help: Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/membership.c,v <-- membership.c new revision: 1.48; previous revision: 1.47 done haven't seen this bug in 5 months since it was fixed. |