Description of problem: After having nodes go down and attempted to be brought back into the cluster, the nodes left up have their services stuck in the recovery state. morph-01 and morph-05 were the nodes which were shot. [root@morph-01 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 0 2 join S-1,80,6 [] DLM Lock Space: "clvmd" 0 3 join S-1,80,6 [] [root@morph-02 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 recover 4 - [6 4 3 1] DLM Lock Space: "clvmd" 2 3 recover 0 - [4 3 1 6] DLM Lock Space: "foobar0" 3 4 recover 0 - [6 4 3 1] DLM Lock Space: "foobar1" 5 6 recover 0 - [6 4 3 1] DLM Lock Space: "foobar2" 7 8 recover 0 - [6 4 3 1] GFS Mount Group: "foobar0" 4 5 recover 0 - [6 4 3 1] GFS Mount Group: "foobar1" 6 7 recover 0 - [6 4 3 1] GFS Mount Group: "foobar2" 8 9 recover 0 - [6 4 3 1] [root@morph-03 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 recover 4 - [4 6 3 1] DLM Lock Space: "clvmd" 2 3 recover 0 - [4 1 3 6] DLM Lock Space: "foobar0" 3 4 recover 0 - [4 6 3 1] DLM Lock Space: "foobar1" 5 6 recover 0 - [4 6 3 1] DLM Lock Space: "foobar2" 7 8 recover 0 - [4 6 3 1] GFS Mount Group: "foobar0" 4 5 recover 0 - [4 6 3 1] GFS Mount Group: "foobar1" 6 7 recover 0 - [4 6 3 1] GFS Mount Group: "foobar2" 8 9 recover 0 - [4 6 3 1] [root@morph-04 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 recover 4 - [3 4 6 1] DLM Lock Space: "clvmd" 2 3 recover 0 - [1 3 4 6] DLM Lock Space: "foobar0" 3 4 recover 0 - [3 4 6 1] DLM Lock Space: "foobar1" 5 6 recover 0 - [3 4 6 1] DLM Lock Space: "foobar2" 7 8 recover 0 - [3 4 6 1] GFS Mount Group: "foobar0" 4 5 recover 0 - [3 4 6 1] GFS Mount Group: "foobar1" 6 7 recover 0 - [3 4 6 1] GFS Mount Group: "foobar2" 8 9 recover 0 - [3 4 6 1] [root@morph-05 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 0 2 join S-1,80,6 [] DLM Lock Space: "clvmd" 0 3 join S-1,280,6 [] [root@morph-06 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 recover 2 - [1 3 4 6] DLM Lock Space: "clvmd" 2 3 recover 0 - [1 4 3 6] DLM Lock Space: "foobar0" 3 4 recover 0 - [1 3 4 6] DLM Lock Space: "foobar1" 5 6 recover 0 - [1 3 4 6] DLM Lock Space: "foobar2" 7 8 recover 0 - [1 3 4 6] GFS Mount Group: "foobar0" 4 5 recover 0 - [1 3 4 6] GFS Mount Group: "foobar1" 6 7 recover 0 - [1 3 4 6] GFS Mount Group: "foobar2" 8 9 recover 0 - [1 3 4 6] How reproducible: Sometimes
morph-06 is in recover state 2 for the fence domain which means it's waiting for fenced or the agent to complete the fencing operation. So, the fencing operation is stuck for some reason -- maybe the same reason as bz 127021? The other nodes in recover state 4 are waiting for morph-06 to finish before doing anything else. morph-01 and 05 are trying to join the fence domain but must wait until the fd completes recovery.
These nodes have your patch from yesterday which fixes 127021 I thought, at least I don't see the radom fencing during start up anymore. But I also don't see the fence attempt during recovery either.
do you still see this? everything in the original report looks ok -- as if a fencing operation is in progress on morph-06.
I do still see this problem of all services being stuck in the recovery state when one or many nodes are taken down and then brought back up. Also, I still never see any fence messages/attempts which is apparently the reason this promblem occurs.
Need to have the perl Net-Telnet rpm installed. I forgot there are no checks/warnings for perl to be installed since we are building everything.
Updating version to the right level in the defects. Sorry for the storm.