Bug 604658 - Some problems when using at least a Eth bridge interface as heartbeat network.
Summary: Some problems when using at least a Eth bridge interface as heartbeat network.
Status: CLOSED UPSTREAM
Alias: None
Product: Corosync Cluster Engine
Classification: Retired
Component: totem (Show other bugs)
(Show other bugs)
Version: 1.3
Hardware: x86_64 Linux
low
medium
Target Milestone: ---
Assignee: Jan Friesse
QA Contact:
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-06-16 12:51 UTC by Moullé Alain
Modified: 2011-10-04 13:51 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-10-04 13:51:06 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

Description Moullé Alain 2010-06-16 12:51:23 UTC
Description of problem:
If we have two ring numbers records in corosync.conf, and one is linked to an ETH bridge, corosync does not start correctly at first start. Pacemaker remains UNCLEAN for both nodes. If then we kill corosync process, remove the lock, and start a 2nd time it works. To complete this issue : if we stop again corosync, change the order of ETH bindnetaddr in both ringnumbers records in corosync.conf on both nodes, then the 1st start fails again the same way, then kill process, remove lock, and 2nd start works again . 

Version-Release number of selected component (if applicable):
corosync-1.1.2-1.fc12.x86_64
but also on more recent releases.

How reproducible:
All is given in "Description of problem"

Steps to Reproduce:
Here is a copy of my email to Steven Drake on Openais ML :
Hi Steve


I've done more systematic tests about this issue , and here are my conclusions: with my network configuration where one heartbeat is on eth1 and one other on

br0 (bridge) , I've tested start of corosync several times with :


at first/ 1st ringnumber linked to eth1 IF AND 2nd ringnumber linked to br0 IF

and then/ 1st ringnumber linked to br0 IF 2nd ringnumber linked to eth1 IF

It appears that just after changing the order networks in rings,
the first start always fails, meaning that crm_mon displays both nodes
as UNCLEAN at vitam eternam. Moreover we can't stop corosync with
/etc/init.d/corosync stop, it remains stalled at vitam eternam and we have
to kill the process and do rm -f of the subsys lock file.

After that , the second start always works ! crm_mon displays both nodes "On" after 60s. And then, we can stop and start corosync many times without any problem anymore.

As soon as I change again the order of networks in rings, the first start fails

again, and I've have to do same thing as described just upon, so that any
further stop/start are working fine again.

So my conclusion is that, if  there is one bridge IF  among both rings for
heartbeat, there is a systematic problem at first start only !

Hope these tests could help.


  
Actual results:


Expected results:


Additional info:

Comment 1 Bug Zapper 2010-07-30 12:08:48 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 14 development cycle.
Changing version to '14'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 2 Steven Dake 2011-07-21 20:40:41 UTC
Honza,

Now that redundant ring is in a supportable state, can you look into determining if this is still an issue?  There are lingering reports about bridging not working with corosync and bridging should be investigated.

Regards
-steve

Comment 3 Jan Friesse 2011-09-01 15:30:34 UTC
Alain,
I've installed fc12 with corosync. My configuration looks like:
br0 with eth0
and eth1

on two nodes.

Sadly, even I was changing order of redundant rings, I was not able to reproduce issue.

What mode of rrp were you using? What concrete actions you were doing to reproduce issue? Wasn't firewall somehow badly configured? Or switch? Were you using different mcast addresses for rings?


(In reply to comment #0)
> Description of problem:

Comment 4 Moullé Alain 2011-09-01 16:01:23 UTC
Hi Jan,
long time ago ... 
but I used to set rrp_mode : active.
No firewall at all.
Same mcast addresses for rings and same port (default:5405)
About concrete actions, nothing more than those described in my first note at creation of this defect.
Regards
Alain

Comment 5 Jan Friesse 2011-10-04 13:51:06 UTC
(In reply to comment #4)
> Hi Jan,
> long time ago ... 
> but I used to set rrp_mode : active.
> No firewall at all.
> Same mcast addresses for rings and same port (default:5405)
> About concrete actions, nothing more than those described in my first note at
> creation of this defect.
> Regards
> Alain

Alain,
main problem seems to be that mcast addresses were same.

Also I was working very hard on getting RRP in proper shape, so I believe that this bug is fixed in RHEL-6 and also in currently supported Fedora package.

Closing as UPSTREAM, please feel free to reopen if you will hit this bug again.


Note You need to log in before you can comment on or make changes to this bug.