Bug 566573 - Continuous reporting of 'daemon cpg_join error retrying'
Summary: Continuous reporting of 'daemon cpg_join error retrying'
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: corosync
Version: 12
Hardware: All
OS: Linux
low
urgent
Target Milestone: ---
Assignee: Steven Dake
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-02-18 20:56 UTC by Ernesto Rodriguez Reina
Modified: 2016-04-26 15:00 UTC (History)
6 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2010-02-19 02:48:08 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Logs and cluster.conf (82.55 KB, application/x-gzip)
2010-02-18 20:56 UTC, Ernesto Rodriguez Reina
no flags Details

Description Ernesto Rodriguez Reina 2010-02-18 20:56:18 UTC
Created attachment 394985 [details]
Logs and cluster.conf

Description of problem:

Continuous reporting of 'daemon cpg_join error retrying' on second node to enter the cluster.

Version-Release number of selected component (if applicable):

Linux Kernel 2.6.38.2
dlm_controld 3.0.6
corosync 1.2.0
fenced 3.0.6
gfs_controld 3.0.6
rgmanager 3.0.6
cman 3.0.6

Actual results:
We see on syslog of the second node to enter the cluster a lot of times:

Feb 18 13:27:15 spare corosync[2109]:   [CLM   ] CLM CONFIGURATION CHANGE
Feb 18 13:27:15 spare corosync[2109]:   [CLM   ] New Configuration:
Feb 18 13:27:15 spare corosync[2109]:   [CLM   ] #011r(0) ip(10.10.10.1)
Feb 18 13:27:15 spare corosync[2109]:   [CLM   ] #011r(0) ip(10.10.10.2)
Feb 18 13:27:15 spare corosync[2109]:   [CLM   ] Members Left:
Feb 18 13:27:15 spare corosync[2109]:   [CLM   ] Members Joined:
Feb 18 13:27:15 spare corosync[2109]:   [CLM   ] CLM CONFIGURATION CHANGE
Feb 18 13:27:15 spare corosync[2109]:   [CLM   ] New Configuration:
Feb 18 13:27:15 spare corosync[2109]:   [CLM   ] #011r(0) ip(10.10.10.1)
Feb 18 13:27:15 spare corosync[2109]:   [CLM   ] #011r(0) ip(10.10.10.2)
Feb 18 13:27:15 spare corosync[2109]:   [CLM   ] Members Left:
Feb 18 13:27:15 spare corosync[2109]:   [CLM   ] Members Joined:
Feb 18 13:27:15 spare corosync[2109]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Feb 18 13:27:23 spare fenced[2162]: daemon cpg_join error retrying
Feb 18 13:27:23 spare gfs_controld[2196]: daemon cpg_join error retrying
Feb 18 13:27:23 spare dlm_controld[2181]: daemon cpg_join error retrying
Feb 18 13:27:25 spare corosync[2109]:   [TOTEM ] A processor failed,
forming new configuration.

Comment 1 Steven Dake 2010-02-18 23:15:23 UTC
which version of openais?

Comment 2 Ernesto Rodriguez Reina 2010-02-18 23:26:51 UTC
(In reply to comment #1)
> which version of openais?    

openais version 1.1.1

Comment 3 Steven Dake 2010-02-18 23:58:58 UTC
This problem has not been reported.  It is possible your switch configuration is not correct, or there is a bug in one of the software components.  To help us isolate the root of the issue, I'd ask that you run corosync directly using the corosync init script.  This will require that the corosync.conf file is configured.  The corosync.conf file is documented in the corosync.conf man page.  (man corosync.conf).

If your network is working properly, you will see:
Feb 18 16:48:15 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Feb 18 16:48:15 corosync [MAIN  ] Completed service synchronization, ready to provide service.

this last line is very important - it indicates synchronization completed properly.


After starting the 2nd node in all cases.  You will never see:
Feb 18 13:27:25 spare corosync[2109]:   [TOTEM ] A processor failed,
forming new configuration.

Comment 4 Ernesto Rodriguez Reina 2010-02-19 02:48:08 UTC
(In reply to comment #3)
> This problem has not been reported.  It is possible your switch configuration
> is not correct,

I try again with RHCS version 3.0.4 (witch I know that works for my environment) and I found the same problem. I asked the network administrator if he made some changes in the few past days and he answered he changed the switch interconnect using now MSTP and redundant routes to root switch. When he rolled it back all works ok in both versions. 

We are going to investigate a bit more about this. My apologies for the inconvenient, and thanks a lot for your help.

Comment 5 Steven Dake 2010-02-19 06:14:31 UTC
Any details you can provide about the brand/model of switch you are using and what exactly you changed to make it work would be useful info for others that run into switch issues.

Regards
-stee


Note You need to log in before you can comment on or make changes to this bug.