I run across this every couple times I attempt to start 'clvmd' on all machines at once. I have 3 x86 machines. When they come up, I issue a 'clvmd' command on all of them at once. One succeeds, the other two hang. (In fact, I issue the command rather than running the init scripts because I repeatedly hit this bug in the past using init scripts). On neo-04, the command hangs and reports the following to the console: dlm: connect from non cluster node Other state includes: [root@neo-04 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 5 2 run - [4 6 5] DLM Lock Space: "clvmd" 10 3 join S-6,20,2 [4 5] [root@neo-04 ~]# cat /proc/cluster/nodes Node Votes Exp Sts Name 4 1 3 M neo-04 5 1 3 M neo-05 6 1 3 M neo-06 On neo-05, the command succeeds and reports the following on the console: dlm: connect from non cluster node Other state includes: [root@neo-05 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 5 2 run - [4 5 6] DLM Lock Space: "clvmd" 10 3 update U-4,1,4 [5 4] [root@neo-05 ~]# cat /proc/cluster/nodes Node Votes Exp Sts Name 4 1 3 M neo-04 5 1 3 M neo-05 6 1 3 M neo-06 On neo-06, the command hangs. [root@neo-06 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 5 2 run - [4 6 5] DLM Lock Space: "clvmd" 0 3 join S-1,80,3 [] [root@neo-06 ~]# cat /proc/cluster/nodes Node Votes Exp Sts Name 4 1 3 M neo-04 5 1 3 M neo-05 6 1 3 M neo-06
The usual cause of this is very odd IP settings. Often two interfaces on the same ethernet can cause it because Linux sends messages out of the "wrong" one and the DLM sees the source address as one that it doesn't know about. Check the ifconfig & route output on the nodes to make sure they are sane.
You're right, they didn't make any sense at all. I guess I was just trusting DHCP... Closing bug. I'll reopen if I come across it again.
FWIW 338511 is related to this bug; in fact, it's the same bug. There's a patch attached to 338511 against dlm-kernel which calls sock->ops->bind() to set the source IP prior to making a connection in lowcomms.c There's probably a reason this wasn't done.