Bug 126758 - clvmd and dlm can deadlock when starting up
Summary: clvmd and dlm can deadlock when starting up
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: gfs
Version: 4
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: David Teigland
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-06-25 22:08 UTC by Corey Marthaler
Modified: 2010-01-12 02:53 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2004-08-25 15:27:49 UTC
Embargoed:


Attachments (Terms of Use)

Description Corey Marthaler 2004-06-25 22:08:55 UTC
From Bugzilla Helper: 
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.1; Linux) 
 
Description of problem: 
1. load the needed mods on all cluster nodes 
2. start ccsd on all cluster nodes 
3. cman_tool join on all cluster nodes (wait until all have actually 
joined) 
4. fence_tool join on all cluster nodes 
5. clvmd on all cluster nodes 
 
Every so often when seting up a cluster, clvmd will not get to the 
"run" state. The following are stuck at the following states: 
morph-01: update 
morph-02: update 
morph-03: join 
morph-04: join 
morph-05: update 
morph-06: update 
 
 
 
 
 
Version-Release number of selected component (if applicable): 
 
 
How reproducible: 
Sometimes

Comment 1 Corey Marthaler 2004-06-25 22:13:20 UTC
kernel logs of the nodes are in: 
 
/home/msp/cmarthal/pub/bugs/126758/kernellogs 

Comment 2 David Teigland 2004-06-30 02:59:40 UTC
next time you get this could you collect the output on each node of
/proc/cluster/sm_debug
/proc/cluster/dlm_debug

and /proc/cluster/nodes from just one node to provide name/id pairs

after getting that, if you have kdb installed, a backtrace of the
dlm_recoverd and dlm_recvd threads


Comment 3 Corey Marthaler 2004-06-30 15:32:44 UTC
/proc/cluster/sm_debug 
 
morph-01: 
tate 7 node 1 
01000002 uevent state 1 node 3 
01000002 uevent state 3 node 3 
01000002 add node 3 count 4 
01000002 uevent state 5 node 3 
01000002 uevent state 7 node 3 
01000002 uevent state 1 node 4 
01000002 uevent state 3 node 4 
01000002 add node 4 count 5 
 
morph-02: 
tate 7 node 1 
01000002 uevent state 1 node 3 
01000002 uevent state 3 node 3 
01000002 add node 3 count 4 
01000002 uevent state 5 node 3 
01000002 uevent state 7 node 3 
01000002 uevent state 1 node 4 
01000002 uevent state 3 node 4 
01000002 add node 4 count 5 
 
morph-03: 
00000000 sevent state 1 
00000000 sevent state 3 
00000000 sevent state 1 
00000000 sevent state 3 
00000000 sevent state 1 
01000002 sevent state 3 
00000000 sevent state 1 
01000002 sevent state 3 
01000002 sevent state 5 
 
morph-04: 
 sevent state 3 
00000000 sevent state 1 
00000000 sevent state 3 
00000000 sevent state 1 
00000000 sevent state 3 
00000000 sevent state 1 
00000000 sevent state 3 
00000000 sevent state 1 
00000000 sevent state 3 
00000000 sevent state 1 
00000000 sevent state 3 
 
morph-05: 
000000 sevent state 1 
01000002 sevent state 3 
00000000 sevent state 1 
01000002 sevent state 3 
01000002 sevent state 5 
01000002 sevent state 7 
01000002 sevent state 9 
01000002 uevent state 1 node 4 
01000002 uevent state 3 node 4 
01000002 add node 4 count 5 
 
morph-06: 
 sevent state 3 
00000000 sevent state 1 
00000000 sevent state 3 
00000000 sevent state 1 
00000000 sevent state 3 
00000000 sevent state 1 
00000000 sevent state 3 
00000000 sevent state 1 
00000000 sevent state 3 
00000000 sevent state 1 
00000000 sevent state 3 
 
 
/proc/cluster/dlm_debug 
 
morph-01: 
clvmd rcom send 1 to 1 id 1124 
clvmd rcom send 1 to 1 id 1125 
clvmd rcom send 1 to 1 id 1126 
clvmd rcom send 1 to 1 id 1127 
clvmd rcom send 1 to 1 id 1128 
clvmd rcom send 1 to 1 id 1129 
clvmd rcom send 1 to 1 id 1130 
clvmd rcom send 1 to 1 id 1131 
clvmd rcom send 1 to 1 id 1132 
clvmd rcom send 1 to 1 id 1133 
clvmd rcom send 1 to 1 id 1134 
clvmd rcom send 1 to 1 id 1135 
clvmd rcom send 1 to 1 id 1136 
clvmd rcom send 1 to 1 id 1137 
clvmd rcom send 1 to 1 id 1138 
clvmd rcom send 1 to 1 id 1139 
clvmd rcom send 1 to 1 id 1140 
clvmd rcom send 1 to 1 id 1141 
clvmd rcom send 1 to 1 id 1142 
clvmd rcom send 1 to 1 id 1143 
clvmd rcom send 1 to 1 id 1144 
clvmd rcom send 1 to 1 id 1145 
clvmd rcom send 1 to 1 id 1146 
clvmd rcom send 1 to 1 id 1147 
clvmd rcom send 1 to 1 id 1148 
clvmd rcom send 1 to 1 id 1149 
clvmd rcom send 1 to 1 id 1150 
clvmd rcom send 1 to 1 id 1151 
clvmd rcom send 1 to 1 id 1152 
clvmd rcom send 1 to 1 id 1153 
clvmd rcom send 1 to 1 id 1154 
clvmd rcom send 1 to 1 id 1155 
clvmd rcom send 1 to 1 id 1156 
 
morph-02: 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
clvmd rcom status 4 to 1 
 
morph-03: 
clvmd move flags 0,1,0 ids 0,2,0 
clvmd move use event 2 
clvmd recover event 2 (first) 
clvmd add nodes 
clvmd rcom send 1 to 1 id 1 
clvmd rcom status 4 to 1 
clvmd rcom names len 8 to 1 id 33 
clvmd rcom send 1 to 1 id 2 
clvmd total nodes 5 
clvmd rebuild resource directory 
clvmd rcom send 2 to 1 id 3 
clvmd rcom names len 8 to 3 id 14 
clvmd rcom send 2 to 2 id 4 
clvmd rcom names len 8 to 6 id 30 
 
morph-04: 
nothing 
 
morph-05: 
clvmd rcom send 1 to 1 id 1108 
clvmd rcom send 1 to 1 id 1109 
clvmd rcom send 1 to 1 id 1110 
clvmd rcom send 1 to 1 id 1111 
clvmd rcom send 1 to 1 id 1112 
clvmd rcom send 1 to 1 id 1113 
clvmd rcom send 1 to 1 id 1114 
clvmd rcom send 1 to 1 id 1115 
clvmd rcom send 1 to 1 id 1116 
clvmd rcom send 1 to 1 id 1117 
clvmd rcom send 1 to 1 id 1118 
clvmd rcom send 1 to 1 id 1119 
clvmd rcom send 1 to 1 id 1120 
clvmd rcom send 1 to 1 id 1121 
clvmd rcom send 1 to 1 id 1122 
clvmd rcom send 1 to 1 id 1123 
clvmd rcom send 1 to 1 id 1124 
clvmd rcom send 1 to 1 id 1125 
clvmd rcom send 1 to 1 id 1126 
clvmd rcom send 1 to 1 id 1127 
clvmd rcom send 1 to 1 id 1128 
clvmd rcom send 1 to 1 id 1129 
clvmd rcom send 1 to 1 id 1130 
clvmd rcom send 1 to 1 id 1131 
clvmd rcom send 1 to 1 id 1132 
clvmd rcom send 1 to 1 id 1133 
clvmd rcom send 1 to 1 id 1134 
clvmd rcom send 1 to 1 id 1135 
clvmd rcom send 1 to 1 id 1136 
clvmd rcom send 1 to 1 id 1137 
clvmd rcom send 1 to 1 id 1138 
clvmd rcom send 1 to 1 id 1139 
clvmd rcom send 1 to 1 id 1140 
 
morph-06: 
clvmd rcom send 1 to 2 id 1149 
clvmd rcom status d to 6 
clvmd rcom status d to 3 
clvmd rcom send 1 to 2 id 1150 
clvmd rcom status d to 6 
clvmd rcom status d to 3 
clvmd rcom send 1 to 2 id 1151 
clvmd rcom status d to 6 
clvmd rcom status d to 3 
clvmd rcom send 1 to 2 id 1152 
clvmd rcom status d to 6 
clvmd rcom status d to 3 
clvmd rcom send 1 to 2 id 1153 
clvmd rcom status d to 6 
clvmd rcom status d to 3 
clvmd rcom send 1 to 2 id 1154 
clvmd rcom status d to 6 
clvmd rcom status d to 3 
clvmd rcom send 1 to 2 id 1155 
clvmd rcom status d to 6 
clvmd rcom status d to 3 
clvmd rcom send 1 to 2 id 1156 
clvmd rcom status d to 6 
clvmd rcom status d to 3 
clvmd rcom send 1 to 2 id 1157 
clvmd rcom status d to 6 
clvmd rcom status d to 3 
clvmd rcom send 1 to 2 id 1158 
clvmd rcom status d to 6 
clvmd rcom status d to 3 
clvmd rcom send 1 to 2 id 1159 
clvmd rcom status d to 6 
clvmd rcom status d to 3 
clvmd rcom send 1 to 2 id 1160 
clvmd rcom status d to 6 
clvmd rcom status d to 3 
clvmd rcom send 1 to 2 id 1161 
 
/proc/cluster/nodes 
 
Node  Votes Exp Sts  Name 
   1    1    6   M   morph-06.lab.msp.redhat.com 
   2    1    6   M   morph-02.lab.msp.redhat.com 
   3    1    6   M   morph-05.lab.msp.redhat.com 
   4    1    6   M   morph-03.lab.msp.redhat.com 
   5    1    6   M   morph-04.lab.msp.redhat.com 
   6    1    6   M   morph-01.lab.msp.redhat.com 
 

Comment 4 Dean Jansa 2004-07-14 15:01:02 UTC
As an added note, this buy prevents a node from re-joining the 
cluster.  After bringing down a node, I get stuck in the same 
"update/join" bug when I start clvmd on the node I am attempting to 
bring back in the cluster. 

Comment 5 Christine Caulfield 2004-07-20 08:00:40 UTC
I think I've nailed this one down. It seems that if two nodes try to
connect to one other node at the same time, the second one gets
ignored so that node's messages never get received. I fixed this for
reads some time ago but (sadly) it didn't occur to me that it would
happen for connects too.

Comment 6 Dean Jansa 2004-08-25 15:27:49 UTC
I have not been able to reproduce this after the fix.  Closing. 
 
However, comment #4 is still vaild, but is covered by bug 128432. 

Comment 7 Kiersten (Kerri) Anderson 2004-11-16 19:02:57 UTC
Updating version to the right level in the defects.  Sorry for the storm.


Note You need to log in before you can comment on or make changes to this bug.