From Bugzilla Helper: User-Agent: Mozilla/5.0 (compatible; Konqueror/3.1; Linux) Description of problem: 1. load the needed mods on all cluster nodes 2. start ccsd on all cluster nodes 3. cman_tool join on all cluster nodes (wait until all have actually joined) 4. fence_tool join on all cluster nodes 5. clvmd on all cluster nodes Every so often when seting up a cluster, clvmd will not get to the "run" state. The following are stuck at the following states: morph-01: update morph-02: update morph-03: join morph-04: join morph-05: update morph-06: update Version-Release number of selected component (if applicable): How reproducible: Sometimes
kernel logs of the nodes are in: /home/msp/cmarthal/pub/bugs/126758/kernellogs
next time you get this could you collect the output on each node of /proc/cluster/sm_debug /proc/cluster/dlm_debug and /proc/cluster/nodes from just one node to provide name/id pairs after getting that, if you have kdb installed, a backtrace of the dlm_recoverd and dlm_recvd threads
/proc/cluster/sm_debug morph-01: tate 7 node 1 01000002 uevent state 1 node 3 01000002 uevent state 3 node 3 01000002 add node 3 count 4 01000002 uevent state 5 node 3 01000002 uevent state 7 node 3 01000002 uevent state 1 node 4 01000002 uevent state 3 node 4 01000002 add node 4 count 5 morph-02: tate 7 node 1 01000002 uevent state 1 node 3 01000002 uevent state 3 node 3 01000002 add node 3 count 4 01000002 uevent state 5 node 3 01000002 uevent state 7 node 3 01000002 uevent state 1 node 4 01000002 uevent state 3 node 4 01000002 add node 4 count 5 morph-03: 00000000 sevent state 1 00000000 sevent state 3 00000000 sevent state 1 00000000 sevent state 3 00000000 sevent state 1 01000002 sevent state 3 00000000 sevent state 1 01000002 sevent state 3 01000002 sevent state 5 morph-04: sevent state 3 00000000 sevent state 1 00000000 sevent state 3 00000000 sevent state 1 00000000 sevent state 3 00000000 sevent state 1 00000000 sevent state 3 00000000 sevent state 1 00000000 sevent state 3 00000000 sevent state 1 00000000 sevent state 3 morph-05: 000000 sevent state 1 01000002 sevent state 3 00000000 sevent state 1 01000002 sevent state 3 01000002 sevent state 5 01000002 sevent state 7 01000002 sevent state 9 01000002 uevent state 1 node 4 01000002 uevent state 3 node 4 01000002 add node 4 count 5 morph-06: sevent state 3 00000000 sevent state 1 00000000 sevent state 3 00000000 sevent state 1 00000000 sevent state 3 00000000 sevent state 1 00000000 sevent state 3 00000000 sevent state 1 00000000 sevent state 3 00000000 sevent state 1 00000000 sevent state 3 /proc/cluster/dlm_debug morph-01: clvmd rcom send 1 to 1 id 1124 clvmd rcom send 1 to 1 id 1125 clvmd rcom send 1 to 1 id 1126 clvmd rcom send 1 to 1 id 1127 clvmd rcom send 1 to 1 id 1128 clvmd rcom send 1 to 1 id 1129 clvmd rcom send 1 to 1 id 1130 clvmd rcom send 1 to 1 id 1131 clvmd rcom send 1 to 1 id 1132 clvmd rcom send 1 to 1 id 1133 clvmd rcom send 1 to 1 id 1134 clvmd rcom send 1 to 1 id 1135 clvmd rcom send 1 to 1 id 1136 clvmd rcom send 1 to 1 id 1137 clvmd rcom send 1 to 1 id 1138 clvmd rcom send 1 to 1 id 1139 clvmd rcom send 1 to 1 id 1140 clvmd rcom send 1 to 1 id 1141 clvmd rcom send 1 to 1 id 1142 clvmd rcom send 1 to 1 id 1143 clvmd rcom send 1 to 1 id 1144 clvmd rcom send 1 to 1 id 1145 clvmd rcom send 1 to 1 id 1146 clvmd rcom send 1 to 1 id 1147 clvmd rcom send 1 to 1 id 1148 clvmd rcom send 1 to 1 id 1149 clvmd rcom send 1 to 1 id 1150 clvmd rcom send 1 to 1 id 1151 clvmd rcom send 1 to 1 id 1152 clvmd rcom send 1 to 1 id 1153 clvmd rcom send 1 to 1 id 1154 clvmd rcom send 1 to 1 id 1155 clvmd rcom send 1 to 1 id 1156 morph-02: clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 clvmd rcom status 4 to 1 morph-03: clvmd move flags 0,1,0 ids 0,2,0 clvmd move use event 2 clvmd recover event 2 (first) clvmd add nodes clvmd rcom send 1 to 1 id 1 clvmd rcom status 4 to 1 clvmd rcom names len 8 to 1 id 33 clvmd rcom send 1 to 1 id 2 clvmd total nodes 5 clvmd rebuild resource directory clvmd rcom send 2 to 1 id 3 clvmd rcom names len 8 to 3 id 14 clvmd rcom send 2 to 2 id 4 clvmd rcom names len 8 to 6 id 30 morph-04: nothing morph-05: clvmd rcom send 1 to 1 id 1108 clvmd rcom send 1 to 1 id 1109 clvmd rcom send 1 to 1 id 1110 clvmd rcom send 1 to 1 id 1111 clvmd rcom send 1 to 1 id 1112 clvmd rcom send 1 to 1 id 1113 clvmd rcom send 1 to 1 id 1114 clvmd rcom send 1 to 1 id 1115 clvmd rcom send 1 to 1 id 1116 clvmd rcom send 1 to 1 id 1117 clvmd rcom send 1 to 1 id 1118 clvmd rcom send 1 to 1 id 1119 clvmd rcom send 1 to 1 id 1120 clvmd rcom send 1 to 1 id 1121 clvmd rcom send 1 to 1 id 1122 clvmd rcom send 1 to 1 id 1123 clvmd rcom send 1 to 1 id 1124 clvmd rcom send 1 to 1 id 1125 clvmd rcom send 1 to 1 id 1126 clvmd rcom send 1 to 1 id 1127 clvmd rcom send 1 to 1 id 1128 clvmd rcom send 1 to 1 id 1129 clvmd rcom send 1 to 1 id 1130 clvmd rcom send 1 to 1 id 1131 clvmd rcom send 1 to 1 id 1132 clvmd rcom send 1 to 1 id 1133 clvmd rcom send 1 to 1 id 1134 clvmd rcom send 1 to 1 id 1135 clvmd rcom send 1 to 1 id 1136 clvmd rcom send 1 to 1 id 1137 clvmd rcom send 1 to 1 id 1138 clvmd rcom send 1 to 1 id 1139 clvmd rcom send 1 to 1 id 1140 morph-06: clvmd rcom send 1 to 2 id 1149 clvmd rcom status d to 6 clvmd rcom status d to 3 clvmd rcom send 1 to 2 id 1150 clvmd rcom status d to 6 clvmd rcom status d to 3 clvmd rcom send 1 to 2 id 1151 clvmd rcom status d to 6 clvmd rcom status d to 3 clvmd rcom send 1 to 2 id 1152 clvmd rcom status d to 6 clvmd rcom status d to 3 clvmd rcom send 1 to 2 id 1153 clvmd rcom status d to 6 clvmd rcom status d to 3 clvmd rcom send 1 to 2 id 1154 clvmd rcom status d to 6 clvmd rcom status d to 3 clvmd rcom send 1 to 2 id 1155 clvmd rcom status d to 6 clvmd rcom status d to 3 clvmd rcom send 1 to 2 id 1156 clvmd rcom status d to 6 clvmd rcom status d to 3 clvmd rcom send 1 to 2 id 1157 clvmd rcom status d to 6 clvmd rcom status d to 3 clvmd rcom send 1 to 2 id 1158 clvmd rcom status d to 6 clvmd rcom status d to 3 clvmd rcom send 1 to 2 id 1159 clvmd rcom status d to 6 clvmd rcom status d to 3 clvmd rcom send 1 to 2 id 1160 clvmd rcom status d to 6 clvmd rcom status d to 3 clvmd rcom send 1 to 2 id 1161 /proc/cluster/nodes Node Votes Exp Sts Name 1 1 6 M morph-06.lab.msp.redhat.com 2 1 6 M morph-02.lab.msp.redhat.com 3 1 6 M morph-05.lab.msp.redhat.com 4 1 6 M morph-03.lab.msp.redhat.com 5 1 6 M morph-04.lab.msp.redhat.com 6 1 6 M morph-01.lab.msp.redhat.com
As an added note, this buy prevents a node from re-joining the cluster. After bringing down a node, I get stuck in the same "update/join" bug when I start clvmd on the node I am attempting to bring back in the cluster.
I think I've nailed this one down. It seems that if two nodes try to connect to one other node at the same time, the second one gets ignored so that node's messages never get received. I fixed this for reads some time ago but (sadly) it didn't occur to me that it would happen for connects too.
I have not been able to reproduce this after the fix. Closing. However, comment #4 is still vaild, but is covered by bug 128432.
Updating version to the right level in the defects. Sorry for the storm.