Description of problem: Version-Release number of selected component (if applicable): This isn't specifically CMAN, but rather the daemons that CMAN starts. I have a two node cluster, if I start the nodes up with the cluster services stopped and then run service cman start on both nodes at the same time they will both hang at [root@rh5cluster1 ~]# service cman start Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... done Starting daemons... fenced is started, so i'm assuming its at the dlm_controld stuff, which isn't showing up in PS. The only messages I get are these Sep 19 15:44:50 rh5cluster1 ccsd[2718]: Initial status:: Quorate Sep 19 15:49:18 rh5cluster1 kernel: DLM (built Aug 30 2006 18:19:57) installed Sep 19 15:49:18 rh5cluster1 kernel: GFS2 (built Aug 30 2006 18:20:27) installed Sep 19 15:49:18 rh5cluster1 kernel: Lock_DLM (built Aug 30 2006 18:20:36) installed Sep 19 15:49:18 rh5cluster1 kernel: dlm: no local IP address has been set Sep 19 15:49:18 rh5cluster1 kernel: dlm: cannot start dlm lowcomms -22 and Sep 19 15:44:50 rh5cluster2 openais[2715]: [TOTEM] entering OPERATIONAL state. Sep 19 15:44:50 rh5cluster2 openais[2715]: [CLM ] got nodejoin message 10.10.1.12 Sep 19 15:44:50 rh5cluster2 openais[2715]: [CLM ] got nodejoin message 10.10.1.13 Sep 19 15:44:50 rh5cluster2 kernel: DLM (built Aug 30 2006 18:19:57) installed Sep 19 15:44:50 rh5cluster2 kernel: GFS2 (built Aug 30 2006 18:20:27) installed Sep 19 15:44:50 rh5cluster2 kernel: Lock_DLM (built Aug 30 2006 18:20:36) installed I will look into it more tomorrow. How reproducible: Everytime Steps to Reproduce: 1.bring up nodes without the cluster services enabled 2.open up ssh to both nodes and type 'service cman start' but dont hit enter 3.hit enter in both terminals as quickly together as possible Actual results: The script hangs at "Starting Daemons" Expected results: It shouldn't hang Additional info: I'm using the newest packages in brew, Cman 2.0.16.
Sep 19 15:49:18 rh5cluster1 kernel: dlm: no local IP address has been set Sep 19 15:49:18 rh5cluster1 kernel: dlm: cannot start dlm lowcomms -22 Obviously those are the key messages. if dlm_controld isn't running then that might explain why the DLM hasn't been configured - it might be that it crashed perhaps? A debug log from dlm_control would be really helpful here if you can get one.
Oh, and it's also checking whether configfs is mounted. The times when I see this message, I have found that configfs hasn't mounted for some reason.
I saw something that is possibly similar to #1 today, where a node was added to the DLM members list before dlm_groupd knew its IP address. DLM kicked out the error: dlm: Initiating association with node 13 dlm: no address for nodeid 13 Is it possible there's a race here? The cman event callback arriving after dlm_controld has decided to add the new node ?
Devel ACK for RHEL 5.0.0 Beta 2
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering. This request is not yet committed for inclusion in release.
This is slightly hacky and I can't seem to reproduce it any more. but it should fix the problem. Basically, if a lockspace contains a node that dlm_control doesn't know about then it re-reads the cman nodes list. Checking in action.c; /cvs/cluster/cluster/group/dlm_controld/action.c,v <-- action.c new revision: 1.7; previous revision: 1.6 done Checking in dlm_daemon.h; /cvs/cluster/cluster/group/dlm_controld/dlm_daemon.h,v <-- dlm_daemon.h new revision: 1.4; previous revision: 1.3 done Checking in member_cman.c; /cvs/cluster/cluster/group/dlm_controld/member_cman.c,v <-- member_cman.c new revision: 1.3; previous revision: 1.2 done
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.