566573 – Continuous reporting of 'daemon cpg_join error retrying'

Bug 566573 - Continuous reporting of 'daemon cpg_join error retrying'

Summary: Continuous reporting of 'daemon cpg_join error retrying'

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	corosync
Sub Component:
Version:	12
Hardware:	All
OS:	Linux
Priority:	low
Severity:	urgent
Target Milestone:	---
Assignee:	Steven Dake
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-02-18 20:56 UTC by Ernesto Rodriguez Reina
Modified:	2016-04-26 15:00 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2010-02-19 02:48:08 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Logs and cluster.conf (82.55 KB, application/x-gzip) 2010-02-18 20:56 UTC, Ernesto Rodriguez Reina	no flags	Details
View All

Description Ernesto Rodriguez Reina 2010-02-18 20:56:18 UTC

Created attachment 394985 [details]
Logs and cluster.conf

Description of problem:

Continuous reporting of 'daemon cpg_join error retrying' on second node to enter the cluster.

Version-Release number of selected component (if applicable):

Linux Kernel 2.6.38.2
dlm_controld 3.0.6
corosync 1.2.0
fenced 3.0.6
gfs_controld 3.0.6
rgmanager 3.0.6
cman 3.0.6

Actual results:
We see on syslog of the second node to enter the cluster a lot of times:

Feb 18 13:27:15 spare corosync[2109]:   [CLM   ] CLM CONFIGURATION CHANGE
Feb 18 13:27:15 spare corosync[2109]:   [CLM   ] New Configuration:
Feb 18 13:27:15 spare corosync[2109]:   [CLM   ] #011r(0) ip(10.10.10.1)
Feb 18 13:27:15 spare corosync[2109]:   [CLM   ] #011r(0) ip(10.10.10.2)
Feb 18 13:27:15 spare corosync[2109]:   [CLM   ] Members Left:
Feb 18 13:27:15 spare corosync[2109]:   [CLM   ] Members Joined:
Feb 18 13:27:15 spare corosync[2109]:   [CLM   ] CLM CONFIGURATION CHANGE
Feb 18 13:27:15 spare corosync[2109]:   [CLM   ] New Configuration:
Feb 18 13:27:15 spare corosync[2109]:   [CLM   ] #011r(0) ip(10.10.10.1)
Feb 18 13:27:15 spare corosync[2109]:   [CLM   ] #011r(0) ip(10.10.10.2)
Feb 18 13:27:15 spare corosync[2109]:   [CLM   ] Members Left:
Feb 18 13:27:15 spare corosync[2109]:   [CLM   ] Members Joined:
Feb 18 13:27:15 spare corosync[2109]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Feb 18 13:27:23 spare fenced[2162]: daemon cpg_join error retrying
Feb 18 13:27:23 spare gfs_controld[2196]: daemon cpg_join error retrying
Feb 18 13:27:23 spare dlm_controld[2181]: daemon cpg_join error retrying
Feb 18 13:27:25 spare corosync[2109]:   [TOTEM ] A processor failed,
forming new configuration.

Comment 1 Steven Dake 2010-02-18 23:15:23 UTC

which version of openais?

Comment 2 Ernesto Rodriguez Reina 2010-02-18 23:26:51 UTC

(In reply to comment #1)
> which version of openais?    

openais version 1.1.1

Comment 3 Steven Dake 2010-02-18 23:58:58 UTC

This problem has not been reported.  It is possible your switch configuration is not correct, or there is a bug in one of the software components.  To help us isolate the root of the issue, I'd ask that you run corosync directly using the corosync init script.  This will require that the corosync.conf file is configured.  The corosync.conf file is documented in the corosync.conf man page.  (man corosync.conf).

If your network is working properly, you will see:
Feb 18 16:48:15 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Feb 18 16:48:15 corosync [MAIN  ] Completed service synchronization, ready to provide service.

this last line is very important - it indicates synchronization completed properly.


After starting the 2nd node in all cases.  You will never see:
Feb 18 13:27:25 spare corosync[2109]:   [TOTEM ] A processor failed,
forming new configuration.

Comment 4 Ernesto Rodriguez Reina 2010-02-19 02:48:08 UTC

(In reply to comment #3)
> This problem has not been reported.  It is possible your switch configuration
> is not correct,

I try again with RHCS version 3.0.4 (witch I know that works for my environment) and I found the same problem. I asked the network administrator if he made some changes in the few past days and he answered he changed the switch interconnect using now MSTP and redundant routes to root switch. When he rolled it back all works ok in both versions. 

We are going to investigate a bit more about this. My apologies for the inconvenient, and thanks a lot for your help.

Comment 5 Steven Dake 2010-02-19 06:14:31 UTC

Any details you can provide about the brand/model of switch you are using and what exactly you changed to make it work would be useful info for others that run into switch issues.

Regards
-stee

Note You need to log in before you can comment on or make changes to this bug.