Created attachment 394985 [details] Logs and cluster.conf Description of problem: Continuous reporting of 'daemon cpg_join error retrying' on second node to enter the cluster. Version-Release number of selected component (if applicable): Linux Kernel 2.6.38.2 dlm_controld 3.0.6 corosync 1.2.0 fenced 3.0.6 gfs_controld 3.0.6 rgmanager 3.0.6 cman 3.0.6 Actual results: We see on syslog of the second node to enter the cluster a lot of times: Feb 18 13:27:15 spare corosync[2109]: [CLM ] CLM CONFIGURATION CHANGE Feb 18 13:27:15 spare corosync[2109]: [CLM ] New Configuration: Feb 18 13:27:15 spare corosync[2109]: [CLM ] #011r(0) ip(10.10.10.1) Feb 18 13:27:15 spare corosync[2109]: [CLM ] #011r(0) ip(10.10.10.2) Feb 18 13:27:15 spare corosync[2109]: [CLM ] Members Left: Feb 18 13:27:15 spare corosync[2109]: [CLM ] Members Joined: Feb 18 13:27:15 spare corosync[2109]: [CLM ] CLM CONFIGURATION CHANGE Feb 18 13:27:15 spare corosync[2109]: [CLM ] New Configuration: Feb 18 13:27:15 spare corosync[2109]: [CLM ] #011r(0) ip(10.10.10.1) Feb 18 13:27:15 spare corosync[2109]: [CLM ] #011r(0) ip(10.10.10.2) Feb 18 13:27:15 spare corosync[2109]: [CLM ] Members Left: Feb 18 13:27:15 spare corosync[2109]: [CLM ] Members Joined: Feb 18 13:27:15 spare corosync[2109]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Feb 18 13:27:23 spare fenced[2162]: daemon cpg_join error retrying Feb 18 13:27:23 spare gfs_controld[2196]: daemon cpg_join error retrying Feb 18 13:27:23 spare dlm_controld[2181]: daemon cpg_join error retrying Feb 18 13:27:25 spare corosync[2109]: [TOTEM ] A processor failed, forming new configuration.
which version of openais?
(In reply to comment #1) > which version of openais? openais version 1.1.1
This problem has not been reported. It is possible your switch configuration is not correct, or there is a bug in one of the software components. To help us isolate the root of the issue, I'd ask that you run corosync directly using the corosync init script. This will require that the corosync.conf file is configured. The corosync.conf file is documented in the corosync.conf man page. (man corosync.conf). If your network is working properly, you will see: Feb 18 16:48:15 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Feb 18 16:48:15 corosync [MAIN ] Completed service synchronization, ready to provide service. this last line is very important - it indicates synchronization completed properly. After starting the 2nd node in all cases. You will never see: Feb 18 13:27:25 spare corosync[2109]: [TOTEM ] A processor failed, forming new configuration.
(In reply to comment #3) > This problem has not been reported. It is possible your switch configuration > is not correct, I try again with RHCS version 3.0.4 (witch I know that works for my environment) and I found the same problem. I asked the network administrator if he made some changes in the few past days and he answered he changed the switch interconnect using now MSTP and redundant routes to root switch. When he rolled it back all works ok in both versions. We are going to investigate a bit more about this. My apologies for the inconvenient, and thanks a lot for your help.
Any details you can provide about the brand/model of switch you are using and what exactly you changed to make it work would be useful info for others that run into switch issues. Regards -stee