158592 – cman master confusion due to recovery

Bug 158592 - cman master confusion due to recovery

Summary: cman master confusion due to recovery

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	cman
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Christine Caulfield
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-05-23 19:54 UTC by Corey Marthaler
Modified:	2009-04-16 19:59 UTC (History)
CC List:	1 user (show)
Fixed In Version:	RHBA-2005-734
Clone Of:
Environment:
Last Closed:	2005-10-07 16:46:16 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2005:734	0	normal	SHIPPED_LIVE	cman-kernel bug fix update	2005-10-07 04:00:00 UTC

Description Corey Marthaler 2005-05-23 19:54:46 UTC

Description of problem:
I have been seeing this quite a bit lately running revolver. Revolver will shoot
it's nodes, and when bringing them back up the cman join ends up dead locking.

In this case, tank-03 and tank-05 were shot, they came back up, had ccsd started
on them, and then cman_tool join attempted.

For whatever reason, tank-01 thinks tank-02 is the master and vice versa:

[root@tank-01 ~]# cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 1
Cluster name: tank-cluster
Cluster ID: 46516
Cluster Member: Yes
Membership state: State-Transition: Master is tank-02
Nodes: 3
Expected_votes: 5
Total_votes: 3
Quorum: 3
Active subsystems: 9
Node name: tank-01
Node addresses: 10.15.84.91


[root@tank-02 ~]# cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 1
Cluster name: tank-cluster
Cluster ID: 46516
Cluster Member: Yes
Membership state: State-Transition: Master is tank-01
Nodes: 3
Expected_votes: 5
Total_votes: 3
Quorum: 3
Active subsystems: 9
Node name: tank-02
Node addresses: 10.15.84.92

Membership state: Join-Wait
[root@tank-03 ~]# cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 1
Cluster name: tank-cluster
Cluster ID: 46516
Cluster Member: No
Membership state: Join-Wait


[root@tank-04 ~]# cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 1
Cluster name: tank-cluster
Cluster ID: 46516
Cluster Member: Yes
Membership state: State-Transition: Master is tank-01
Nodes: 3
Expected_votes: 5
Total_votes: 3
Quorum: 3
Active subsystems: 9
Node name: tank-04
Node addresses: 10.15.84.94



[root@tank-02 ~]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    5   M   tank-01
   2    1    5   X   tank-03
   3    1    5   M   tank-02
   4    1    5   M   tank-04
   5    1    5   X   tank-05


[root@tank-02 ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[3 4 2 5 1]

DLM Lock Space:  "clvmd"                             3   3 run       -
[3 4 2 5 1]

DLM Lock Space:  "corey0"                            4   4 run       -
[3 4 2 5 1]

DLM Lock Space:  "corey1"                            6   6 run       -
[3 4 2 5 1]

GFS Mount Group: "corey0"                            5   5 run       -
[3 4 2 5 1]

GFS Mount Group: "corey1"                            7   7 run       -
[3 4 2 5 1]



[root@tank-01 ~]# cat /proc/cluster/dlm_stats
DLM stats (HZ=1000)

Lock operations:    2862611
Unlock operations:  2853855
Convert operations: 11399480
Completion ASTs:    17115870
Blocking ASTs:            2

Lockqueue        num  waittime   ave
WAIT_RSB       21573     99436     4
WAIT_GRANT      5606      1597     0
WAIT_UNLOCK       30        50     1
Total          27209    101083     3


[root@tank-02 ~]# cat /proc/cluster/dlm_stats
DLM stats (HZ=1000)

Lock operations:    4235278
Unlock operations:  4226584
Convert operations: 14944074
Completion ASTs:    23405844
Blocking ASTs:          115

Lockqueue        num  waittime   ave
WAIT_RSB      1197641   18662165    15
WAIT_CONV         31       501    16
WAIT_GRANT      6009     23749     3
WAIT_UNLOCK      353      3951    11
Total         1204034   18690366    15


[root@tank-04 ~]# cat /proc/cluster/dlm_stats
DLM stats (HZ=1000)

Lock operations:     574693
Unlock operations:   562742
Convert operations: 2112002
Completion ASTs:    3249411
Blocking ASTs:           18

Lockqueue        num  waittime   ave
WAIT_RSB      534095   17185688    32
WAIT_GRANT      5669      5905     1
WAIT_UNLOCK       78      1555    19
Total         539842   17193148    31



[root@tank-01 ~]# cat /proc/cluster/dlm_debug
clvmd move flags 0,1,0 ids 0,2,0
clvmd move use event 2
clvmd recover event 2 (first)
clvmd add nodes
clvmd total nodes 5
clvmd rebuild resource directory
clvmd rebuilt 0 resources
clvmd recover event 2 done
clvmd move flags 0,0,1 ids 0,2,2
clvmd process held requests
clvmd processed 0 requests
clvmd recover event 2 finished
corey0 move flags 0,1,0 ids 0,3,0
corey0 move use event 3
corey0 recover event 3 (first)
corey0 add nodes
corey0 total nodes 5
corey0 rebuild resource directory
corey0 rebuilt 5812 resources
corey0 recover event 3 done
corey0 move flags 0,0,1 ids 0,3,3
corey0 process held requests
corey0 processed 0 requests
corey0 recover event 3 finished
corey1 move flags 0,1,0 ids 0,5,0
corey1 move use event 5
corey1 recover event 5 (first)
corey1 add nodes
corey1 total nodes 5
corey1 rebuild resource directory
corey1 rebuilt 5870 resources
corey1 recover event 5 done
corey1 move flags 0,0,1 ids 0,5,5
corey1 process held requests
corey1 processed 0 requests
corey1 recover event 5 finished



[root@tank-02 ~]# cat /proc/cluster/dlm_debug
00000 node -1/-1 "       7
corey0 resent 4 requests
corey0 recover event 87 finished
corey1 move flags 1,0,0 ids 85,85,85
corey1 move flags 0,1,0 ids 85,89,85
corey1 move use event 89
corey1 recover event 89
corey1 add node 1
corey1 total nodes 5
corey1 rebuild resource directory
corey1 rebuilt 5952 resources
corey1 purge requests
corey1 purged 0 requests
corey1 mark waiting requests
corey1 mark 2be008e lq 1 nodeid -1
corey1 mark 2bb029e lq 1 nodeid -1
corey1 mark 2b20362 lq 1 nodeid -1
corey1 mark 2c70149 lq 1 nodeid -1
corey1 marked 4 requests
corey1 recover event 89 done
corey1 move flags 0,0,1 ids 85,89,89
corey1 process held requests
corey1 processed 0 requests
corey1 resend marked requests
corey1 resend 2be008e lq 1 flg 200000 node -1/-1 "       7
corey1 resend 2bb029e lq 1 flg 200000 node -1/-1 "      11
corey1 resend 2b20362 lq 1 flg 200000 node -1/-1 "       7
corey1 resend 2c70149 lq 1 flg 200000 node -1/-1 "      11
corey1 resent 4 requests
corey1 recover event 89 finished



Version-Release number of selected component (if applicable):
[root@tank-01 ~]# rpm -qa | grep cman
cman-1.0-0.pre33.14
cman-kernheaders-2.6.9-34.3
cman-kernel-smp-2.6.9-34.3


How reproducible:
revolver appear to always eventually hit this

Comment 1 Christine Caulfield 2005-05-24 15:03:11 UTC

This might happen if two nodes go into a CHECK transition at slightly different
(but still overlapping) times. This checkin fixes that problem. I hope it also
fixes this problem!

Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v  <--  membership.c
new revision: 1.44.2.19; previous revision: 1.44.2.18
done

Comment 3 Red Hat Bugzilla 2005-10-07 16:46:16 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-734.html

Note You need to log in before you can comment on or make changes to this bug.