Bug 187777
Summary: | dlm_emergency_shutdown caused by another node being rebooted | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> | ||||
Component: | dlm | Assignee: | Christine Caulfield <ccaulfie> | ||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4 | CC: | ccaulfie, cluster-maint, jbacik | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | RHBA-2006-0518 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2006-05-25 21:04:47 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 180185 | ||||||
Attachments: |
|
Description
Corey Marthaler
2006-04-03 15:54:32 UTC
Here's link-08's view of the world: [root@link-08 ~]# cat /proc/cluster/nodes Node Votes Exp Sts Name 1 1 3 X link-01 2 1 3 X link-02 3 1 3 M link-08 [root@link-08 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 recover 0 - [3] DLM Lock Space: "clvmd" 3 3 recover 0 - [3] DLM Lock Space: "gfs0" 4 4 recover 0 - [3] DLM Lock Space: "gfs1" 6 6 recover 0 - [3] GFS Mount Group: "gfs0" 5 5 recover 0 - [3] GFS Mount Group: "gfs1" 7 7 recover 0 - [3] User: "usrm::manager" 8 8 recover 0 - [3] [root@link-08 cluster]# cat dlm_debug d -1 gfs1 mark 650054 lq 1 nodeid -1 gfs1 marked 2 requests gfs1 purge locks of departed nodes gfs1 purged 1 locks gfs1 update remastered resources clvmd move flags 0,0,1 ids 2,8,8 clvmd process held requests clvmd processed 0 requests clvmd resend marked requests clvmd resent 0 requests clvmd recover event 8 finished gfs0 rebuilt 1099 locks gfs0 recover event 8 done gfs1 updated 1913 resources gfs1 rebuild locks gfs0 move flags 0,0,1 ids 3,8,8 gfs0 process held requests gfs0 processed 0 requests gfs0 resend marked requests gfs0 resent 0 requests gfs0 recover event 8 finished gfs1 rebuilt 1919 locks gfs1 recover event 8 done gfs1 move flags 0,0,1 ids 5,8,8 gfs1 process held requests gfs1 processed 0 requests gfs1 resend marked requests gfs1 resend 5901bb lq 1 flg 200000 node -1/-1 " 7 gfs1 resend 650054 lq 1 flg 200000 node -1/-1 " 5 gfs1 resent 2 requests gfs1 recover event 8 finished gfs1 move flags 1,0,0 ids 8,8,8 clvmd move flags 1,0,0 ids 8,8,8 gfs0 move flags 1,0,0 ids 8,8,8 [root@link-08 cluster]# cat dlm_stats DLM stats (HZ=1000) Lock operations: 197760 Unlock operations: 176703 Convert operations: 627646 Completion ASTs: 1002106 Blocking ASTs: 5 Lockqueue num waittime ave WAIT_RSB 104635 418221 3 WAIT_CONV 7 8 1 WAIT_GRANT 26678 17879 0 WAIT_UNLOCK 11447 91841 8 Total 142767 527949 3 [root@link-08 cluster]# cat sm_debug cover state 3 03000008 recover state 5 00000001 remove node 2 count 1 01000006 remove node 2 count 1 01000004 remove node 2 count 1 01000003 remove node 2 count 1 02000007 remove node 2 count 1 02000005 remove node 2 count 1 03000008 remove node 2 count 1 [root@link-08 cluster]# cat status Protocol version: 5.0.1 Config version: 2 Cluster name: LINK_128 Cluster ID: 19208 Cluster Member: Yes Membership state: Cluster-Member Nodes: 1 Expected_votes: 3 Total_votes: 1 Quorum: 2 Activity blocked Active subsystems: 10 Node name: link-08 Node addresses: 10.15.89.158 cman has shut down on link-02, /var/log/messages might give some more info on why. The dlm response is correct. Just hit this again on the link cluster and the following nodes all ended up withdrawing and panicing: link-02, link-04, link-07, link-08. This was after link-01 and link-07 were shot and brought back into the cluster. Here are the requested /var/log/messages of all nodes. link-01: May 2 10:38:07 link-01 kernel: CMAN: Waiting to join or form a Linux-cluster May 2 10:38:08 link-01 kernel: CMAN: sending membership request May 2 10:38:08 link-01 ccsd[3832]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.5 May 2 10:38:08 link-01 ccsd[3832]: Initial status:: Inquorate May 2 10:38:09 link-01 kernel: CMAN: got node link-03 May 2 10:38:09 link-01 kernel: CMAN: got node link-04 May 2 10:38:09 link-01 kernel: CMAN: got node link-08 May 2 10:38:09 link-01 kernel: CMAN: got node link-05 May 2 10:38:09 link-01 kernel: CMAN: got node link-02 May 2 10:38:24 link-01 kernel: CMAN: node link-07 rejoining May 2 10:38:25 link-01 kernel: CMAN: Being told to leave the cluster by node 6 May 2 10:38:25 link-01 kernel: CMAN: we are leaving the cluster. May 2 10:38:25 link-01 kernel: WARNING: dlm_emergency_shutdown May 2 10:38:25 link-01 kernel: WARNING: dlm_emergency_shutdown link-02: May 2 10:33:58 link-02 kernel: CMAN: node link-07 has been removed from the cluster : Missed too many heartbeats May 2 10:34:09 link-02 kernel: CMAN: removing node link-01 from the cluster : No response to messages May 2 10:34:16 link-02 fenced[6033]: fencing deferred to link-03 May 2 10:35:05 link-02 kernel: GFS: fsid=LINK_CLUSTER:link1.5: jid=4: Trying to acquire journal lock... May 2 10:35:05 link-02 kernel: GFS: fsid=LINK_CLUSTER:link0.5: jid=4: Trying to acquire journal lock. .. May 2 10:35:05 link-02 kernel: GFS: fsid=LINK_CLUSTER:link1.5: jid=4: Busy May 2 10:35:05 link-02 kernel: GFS: fsid=LINK_CLUSTER:link1.5: jid=2: Trying to acquire journal lock. .. May 2 10:35:05 link-02 kernel: GFS: fsid=LINK_CLUSTER:link0.5: jid=4: Busy May 2 10:35:05 link-02 kernel: GFS: fsid=LINK_CLUSTER:link0.5: jid=2: Trying to acquire journal lock... May 2 10:35:05 link-02 kernel: GFS: fsid=LINK_CLUSTER:link0.5: jid=2: Looking at journal... May 2 10:35:05 link-02 kernel: GFS: fsid=LINK_CLUSTER:link1.5: jid=2: Busy May 2 10:35:05 link-02 kernel: GFS: fsid=LINK_CLUSTER:link0.5: jid=2: Acquiring the transaction lock... May 2 10:35:06 link-02 kernel: GFS: fsid=LINK_CLUSTER:link0.5: jid=2: Replaying journal... May 2 10:35:06 link-02 kernel: GFS: fsid=LINK_CLUSTER:link0.5: jid=2: Replayed 148 of 148 blocks May 2 10:35:06 link-02 kernel: GFS: fsid=LINK_CLUSTER:link0.5: jid=2: replays = 148, skips = 0, sames = 0 May 2 10:35:06 link-02 kernel: GFS: fsid=LINK_CLUSTER:link0.5: jid=2: Journal replayed in 1s May 2 10:35:06 link-02 kernel: GFS: fsid=LINK_CLUSTER:link0.5: jid=2: Done May 2 10:39:06 link-02 kernel: CMAN: node link-01 rejoining May 2 10:39:08 link-02 kernel: CMAN: node link-07 rejoining May 2 10:39:23 link-02 kernel: CMAN: removing node link-01 from the cluster : Shutdown May 2 10:39:41 link-02 kernel: CMAN: node link-08 has been removed from the cluster : Inconsistent cluster view May 2 10:39:59 link-02 kernel: CMAN: node link-03 has been removed from the cluster : Inconsistent cluster view May 2 10:40:25 link-02 kernel: CMAN: node link-04 has been removed from the cluster : Inconsistent cluster view May 2 10:40:46 link-02 kernel: CMAN: we are leaving the cluster. Inconsistent cluster view May 2 10:40:46 link-02 kernel: WARNING: dlm_emergency_shutdown May 2 10:40:46 link-02 kernel: WARNING: dlm_emergency_shutdown May 2 10:40:46 link-02 kernel: SM: 00000001 sm_stop: SG still joined May 2 10:40:46 link-02 ccsd[3966]: Cluster manager shutdown. Attemping to reconnect... May 2 10:40:46 link-02 kernel: SM: 01000002 sm_stop: SG still joined May 2 10:40:46 link-02 kernel: SM: 02000004 sm_stop: SG still joined May 2 10:41:15 link-02 ccsd[3966]: Unable to connect to cluster infrastructure after 30 seconds. May 2 10:41:37 link-02 kernel: dlm: dlm_unlock: lkid 110200 lockspace not found link-03: May 2 11:56:40 link-03 kernel: CMAN: node link-07 has been removed from the cluster : Missed too many heartbeats May 2 11:56:51 link-03 kernel: CMAN: removing node link-01 from the cluster : No response to messages May 2 11:57:28 link-03 fenced[3510]: link-01 not a cluster member after 30 sec post_fail_delay May 2 11:57:28 link-03 fenced[3510]: link-07 not a cluster member after 30 sec post_fail_delay May 2 11:57:28 link-03 fenced[3510]: fencing node "link-01" May 2 11:57:33 link-03 fenced[3510]: fence "link-01" success May 2 11:57:38 link-03 fenced[3510]: fencing node "link-07" May 2 11:57:41 link-03 fenced[3510]: fence "link-07" success May 2 11:57:48 link-03 kernel: GFS: fsid=LINK_CLUSTER:link1.0: jid=4: Trying to acquire journal lock. .. May 2 11:57:48 link-03 kernel: GFS: fsid=LINK_CLUSTER:link0.1: jid=4: Trying to acquire journal lock. .. May 2 11:57:48 link-03 kernel: GFS: fsid=LINK_CLUSTER:link1.0: jid=4: Busy May 2 11:57:48 link-03 kernel: GFS: fsid=LINK_CLUSTER:link1.0: jid=2: Trying to acquire journal lock. .. May 2 11:57:48 link-03 kernel: GFS: fsid=LINK_CLUSTER:link0.1: jid=4: Busy May 2 11:57:48 link-03 kernel: GFS: fsid=LINK_CLUSTER:link0.1: jid=2: Trying to acquire journal lock. .. May 2 11:57:48 link-03 kernel: GFS: fsid=LINK_CLUSTER:link1.0: jid=2: Busy May 2 11:57:48 link-03 kernel: GFS: fsid=LINK_CLUSTER:link0.1: jid=2: Busy May 2 12:01:49 link-03 kernel: CMAN: node link-01 rejoining May 2 12:01:50 link-03 kernel: CMAN: node link-07 rejoining May 2 12:02:21 link-03 kernel: CMAN: removing node link-01 from the cluster : No response to messages May 2 12:02:24 link-03 kernel: CMAN: removing node link-08 from the cluster : Inconsistent cluster view May 2 12:02:42 link-03 kernel: CMAN: we are leaving the cluster. Inconsistent cluster view May 2 12:02:42 link-03 kernel: WARNING: dlm_emergency_shutdown May 2 12:02:42 link-03 kernel: WARNING: dlm_emergency_shutdown May 2 12:02:42 link-03 kernel: SM: 00000001 sm_stop: SG still joined May 2 12:02:42 link-03 kernel: SM: 01000002 sm_stop: SG still joined May 2 12:02:42 link-03 kernel: SM: 02000004 sm_stop: SG still joined May 2 12:02:42 link-03 ccsd[3400]: Cluster manager shutdown. Attemping to reconnect... May 2 12:03:00 link-03 ccsd[3400]: Unable to connect to cluster infrastructure after 30 seconds. May 2 12:03:30 link-03 ccsd[3400]: Unable to connect to cluster infrastructure after 60 seconds. link-04: May 2 05:56:01 link-04 kernel: CMAN: node link-07 has been removed from the cluster : Missed too many heartbeats May 2 05:56:12 link-04 kernel: CMAN: removing node link-01 from the cluster : No response to messages May 2 05:56:19 link-04 fenced[5446]: fencing deferred to link-03 May 2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link1.3: jid=4: Trying to acquire journal lock. .. May 2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link0.3: jid=4: Trying to acquire journal lock. .. May 2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link1.3: jid=4: Looking at journal... May 2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link0.3: jid=4: Looking at journal... May 2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link1.3: jid=4: Acquiring the transaction lock. .. May 2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link0.3: jid=4: Acquiring the transaction lock. .. May 2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link1.3: jid=4: Replaying journal... May 2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link1.3: jid=4: Replayed 0 of 169 blocks May 2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link1.3: jid=4: replays = 0, skips = 2, sames = 167 May 2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link1.3: jid=4: Journal replayed in 1s May 2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link1.3: jid=4: Done May 2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link1.3: jid=2: Trying to acquire journal lock. .. May 2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link1.3: jid=2: Busy May 2 05:57:09 link-04 kernel: GFS: fsid=LINK_CLUSTER:link0.3: jid=4: Replaying journal... May 2 05:57:09 link-04 kernel: GFS: fsid=LINK_CLUSTER:link0.3: jid=4: Replayed 0 of 3 blocks May 2 05:57:09 link-04 kernel: GFS: fsid=LINK_CLUSTER:link0.3: jid=4: replays = 0, skips = 0, sames = 3 May 2 05:57:09 link-04 kernel: GFS: fsid=LINK_CLUSTER:link0.3: jid=4: Journal replayed in 1s May 2 05:57:09 link-04 kernel: GFS: fsid=LINK_CLUSTER:link0.3: jid=4: Done May 2 05:57:09 link-04 kernel: GFS: fsid=LINK_CLUSTER:link0.3: jid=2: Trying to acquire journal lock. .. May 2 05:57:09 link-04 kernel: GFS: fsid=LINK_CLUSTER:link0.3: jid=2: Looking at journal... May 2 05:57:09 link-04 kernel: GFS: fsid=LINK_CLUSTER:link0.3: jid=2: Done May 2 06:01:09 link-04 kernel: CMAN: node link-01 rejoining May 2 06:01:11 link-04 kernel: CMAN: node link-07 rejoining May 2 06:01:41 link-04 kernel: CMAN: removing node link-01 from the cluster : No response to messages May 2 06:01:44 link-04 kernel: CMAN: node link-08 has been removed from the cluster : Inconsistent cl uster view May 2 06:02:02 link-04 kernel: CMAN: node link-03 has been removed from the cluster : Inconsistent cluster view May 2 06:02:28 link-04 kernel: CMAN: we are leaving the cluster. Inconsistent cluster view May 2 06:02:28 link-04 kernel: WARNING: dlm_emergency_shutdown link-05: May 2 04:47:02 link-05 kernel: CMAN: node link-07 has been removed from the cluster : Missed too many heartbeats May 2 04:47:13 link-05 kernel: CMAN: removing node link-01 from the cluster : No response to messages May 2 04:47:20 link-05 fenced[5968]: fencing deferred to link-03 May 2 04:48:10 link-05 kernel: GFS: fsid=LINK_CLUSTER:link1.6: jid=4: Trying to acquire journal lock... May 2 04:48:10 link-05 kernel: GFS: fsid=LINK_CLUSTER:link0.6: jid=4: Trying to acquire journal lock... May 2 04:48:10 link-05 kernel: GFS: fsid=LINK_CLUSTER:link1.6: jid=4: Busy May 2 04:48:10 link-05 kernel: GFS: fsid=LINK_CLUSTER:link1.6: jid=2: Trying to acquire journal lock... May 2 04:48:10 link-05 kernel: GFS: fsid=LINK_CLUSTER:link0.6: jid=4: Busy May 2 04:48:10 link-05 kernel: GFS: fsid=LINK_CLUSTER:link0.6: jid=2: Trying to acquire journal lock... May 2 04:48:10 link-05 kernel: GFS: fsid=LINK_CLUSTER:link1.6: jid=2: Busy May 2 04:48:10 link-05 kernel: GFS: fsid=LINK_CLUSTER:link0.6: jid=2: Busy May 2 04:52:10 link-05 kernel: CMAN: node link-01 rejoining May 2 04:52:12 link-05 kernel: CMAN: node link-07 rejoining May 2 04:52:42 link-05 kernel: CMAN: removing node link-01 from the cluster : No response to messages May 2 04:52:45 link-05 kernel: CMAN: node link-08 has been removed from the cluster : Inconsistent cluster view May 2 04:53:03 link-05 kernel: CMAN: removing node link-03 from the cluster : Inconsistent cluster view May 2 04:53:29 link-05 kernel: CMAN: removing node link-04 from the cluster : Inconsistent cluster view May 2 04:53:50 link-05 kernel: CMAN: removing node link-02 from the cluster : Inconsistent cluster view May 2 04:58:53 link-05 kernel: CMAN: too many transition restarts - will die May 2 04:58:53 link-05 kernel: CMAN: we are leaving the cluster. Inconsistent cluster view May 2 04:58:53 link-05 kernel: WARNING: dlm_emergency_shutdown May 2 04:58:54 link-05 kernel: WARNING: dlm_emergency_shutdown May 2 04:58:54 link-05 kernel: SM: 00000001 sm_stop: SG still joined May 2 04:58:54 link-05 kernel: SM: 01000002 sm_stop: SG still joined May 2 04:58:54 link-05 kernel: SM: 02000004 sm_stop: SG still joined May 2 04:58:54 link-05 ccsd[3894]: Cluster manager shutdown. Attemping to reconnect... May 2 04:59:20 link-05 ccsd[3894]: Unable to connect to cluster infrastructure after 30 seconds. May 2 04:59:50 link-05 ccsd[3894]: Unable to connect to cluster infrastructure after 60 seconds. link-07: May 2 06:55:23 link-07 kernel: CMAN: Waiting to join or form a Linux-cluster May 2 06:55:24 link-07 ccsd[4092]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.5 May 2 06:55:24 link-07 ccsd[4092]: Initial status:: Inquorate May 2 06:55:24 link-07 kernel: CMAN: sending membership request May 2 06:55:25 link-07 kernel: CMAN: sending membership request May 2 06:55:41 link-07 kernel: CMAN: got node link-03 May 2 06:55:41 link-07 kernel: CMAN: got node link-04 May 2 06:55:41 link-07 kernel: CMAN: got node link-02 May 2 06:55:41 link-07 kernel: CMAN: got node link-05 May 2 06:55:41 link-07 kernel: CMAN: got node link-01 May 2 06:55:41 link-07 kernel: CMAN: got node link-08 May 2 06:55:56 link-07 kernel: CMAN: removing node link-01 from the cluster : No response to messages May 2 06:55:59 link-07 kernel: CMAN: node link-08 has been removed from the cluster : Inconsistent cluster view May 2 06:56:00 link-07 kernel: CMAN: got WAIT barrier not in phase 1 TRANSITION.1330 (2) May 2 06:56:06 link-07 kernel: CMAN: quorum regained, resuming activity May 2 06:56:06 link-07 ccsd[4092]: Cluster is quorate. Allowing connections. May 2 06:56:17 link-07 kernel: CMAN: node link-03 has been removed from the cluster : Inconsistent cluster view May 2 06:56:43 link-07 kernel: CMAN: node link-04 has been removed from the cluster : Inconsistent cluster view May 2 06:57:04 link-07 kernel: CMAN: node link-02 has been removed from the cluster : Inconsistent cluster view May 2 07:02:21 link-07 kernel: CMAN: removing node link-05 from the cluster : Missed too many heartbeats link-08: May 2 10:26:04 link-08 kernel: CMAN: removing node link-07 from the cluster : Missed too many heartbeats May 2 10:26:10 link-08 kernel: CMAN: removing node link-01 from the cluster : No response to messages May 2 10:26:22 link-08 fenced[6112]: fencing deferred to link-03 May 2 10:27:12 link-08 kernel: GFS: fsid=LINK_CLUSTER:link1.1: jid=4: Trying to acquire journal lock... May 2 10:27:12 link-08 kernel: GFS: fsid=LINK_CLUSTER:link0.0: jid=4: Trying to acquire journal lock... May 2 10:27:12 link-08 kernel: GFS: fsid=LINK_CLUSTER:link1.1: jid=4: Busy May 2 10:27:12 link-08 kernel: GFS: fsid=LINK_CLUSTER:link1.1: jid=2: Trying to acquire journal lock... May 2 10:27:12 link-08 kernel: GFS: fsid=LINK_CLUSTER:link1.1: jid=2: Looking at journal... May 2 10:27:12 link-08 kernel: GFS: fsid=LINK_CLUSTER:link0.0: jid=4: Busy May 2 10:27:12 link-08 kernel: GFS: fsid=LINK_CLUSTER:link0.0: jid=2: Trying to acquire journal lock... May 2 10:27:12 link-08 kernel: GFS: fsid=LINK_CLUSTER:link0.0: jid=2: Busy May 2 10:27:12 link-08 kernel: GFS: fsid=LINK_CLUSTER:link1.1: jid=2: Acquiring the transaction lock... May 2 10:27:12 link-08 kernel: GFS: fsid=LINK_CLUSTER:link1.1: jid=2: Replaying journal... May 2 10:27:13 link-08 kernel: GFS: fsid=LINK_CLUSTER:link1.1: jid=2: Replayed 765 of 1123 blocks May 2 10:27:13 link-08 kernel: GFS: fsid=LINK_CLUSTER:link1.1: jid=2: replays = 765, skips = 142, sames = 216 May 2 10:27:13 link-08 kernel: GFS: fsid=LINK_CLUSTER:link1.1: jid=2: Journal replayed in 1s May 2 10:27:13 link-08 kernel: GFS: fsid=LINK_CLUSTER:link1.1: jid=2: Done May 2 10:31:13 link-08 kernel: CMAN: node link-01 rejoining May 2 10:31:14 link-08 kernel: CMAN: node link-07 rejoining May 2 10:31:36 link-08 kernel: CMAN: removing node link-01 from the cluster : No response to messages May 2 10:31:48 link-08 kernel: CMAN: we are leaving the cluster. Inconsistent cluster view May 2 10:31:48 link-08 kernel: WARNING: dlm_emergency_shutdown May 2 10:31:48 link-08 kernel: dlm: link0: remote_stage error -105 6e02a9 May 2 10:31:48 link-08 kernel: dlm: link0: remote_stage error -105 880212 May 2 10:31:48 link-08 kernel: dlm: link1: remote_stage error -105 480354 May 2 10:31:48 link-08 kernel: WARNING: dlm_emergency_shutdown May 2 10:31:48 link-08 kernel: SM: 00000001 sm_stop: SG still joined May 2 10:31:48 link-08 kernel: SM: 01000002 sm_stop: SG still joined May 2 10:31:48 link-08 kernel: SM: 02000004 sm_stop: SG still joined May 2 10:31:48 link-08 ccsd[4045]: Cluster manager shutdown. Attemping to reconnect... If that "Inconsistent cluster view" is the cause of the DLM shutdowns then this is a CMAN bug and a damnably difficult one to isolate IIRC. Something similar was seen on the mailing list a cuple of weeks ago. I was sent a tcpdump and around 50-70% of the network packets were going missing in that case. I doubt (and hope) that isn't happening here! I've set a revolver running on my 8node bench cluster and its been running all day now. This is an extension of bz#177163 Here's how to reproduce it: 1. Start up a few nodes (minimum 3) 2. Let them run until the sequence numbers get > 32767 (OK this is hard to verify but with some code finagling its easy to fake) 3. Remove one node 4. Restart that node (with sequence numbers starting from 0) Boom. The whole cluster falls apart. Created attachment 128650 [details]
Proposed patch to fix
This is the patch I'm currently testing,
if it runs OK over the weekend I'll commit it.
Anyone else please feel free to try it :-)
This ran OK over the weekend and other tests. -rRHEL4 Checking in cnxman.c; /cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v <-- cnxman.c new revision: 1.42.2.25; previous revision: 1.42.2.24 done Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/Attic/membership.c,v <-- membership.c new revision: 1.44.2.23; previous revision: 1.44.2.22 done -rSTABLE Checking in cnxman.c; /cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v <-- cnxman.c new revision: 1.42.2.12.4.1.2.12; previous revision: 1.42.2.12.4.1.2.11 done Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/Attic/membership.c,v <-- membership.c new revision: 1.44.2.18.6.5; previous revision: 1.44.2.18.6.4 done An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0518.html |