Description of problem:
If we have two ring numbers records in corosync.conf, and one is linked to an ETH bridge, corosync does not start correctly at first start. Pacemaker remains UNCLEAN for both nodes. If then we kill corosync process, remove the lock, and start a 2nd time it works. To complete this issue : if we stop again corosync, change the order of ETH bindnetaddr in both ringnumbers records in corosync.conf on both nodes, then the 1st start fails again the same way, then kill process, remove lock, and 2nd start works again .
Version-Release number of selected component (if applicable):
but also on more recent releases.
All is given in "Description of problem"
Steps to Reproduce:
Here is a copy of my email to Steven Drake on Openais ML :
I've done more systematic tests about this issue , and here are my conclusions: with my network configuration where one heartbeat is on eth1 and one other on
br0 (bridge) , I've tested start of corosync several times with :
at first/ 1st ringnumber linked to eth1 IF AND 2nd ringnumber linked to br0 IF
and then/ 1st ringnumber linked to br0 IF 2nd ringnumber linked to eth1 IF
It appears that just after changing the order networks in rings,
the first start always fails, meaning that crm_mon displays both nodes
as UNCLEAN at vitam eternam. Moreover we can't stop corosync with
/etc/init.d/corosync stop, it remains stalled at vitam eternam and we have
to kill the process and do rm -f of the subsys lock file.
After that , the second start always works ! crm_mon displays both nodes "On" after 60s. And then, we can stop and start corosync many times without any problem anymore.
As soon as I change again the order of networks in rings, the first start fails
again, and I've have to do same thing as described just upon, so that any
further stop/start are working fine again.
So my conclusion is that, if there is one bridge IF among both rings for
heartbeat, there is a systematic problem at first start only !
Hope these tests could help.
This bug appears to have been reported against 'rawhide' during the Fedora 14 development cycle.
Changing version to '14'.
More information and reason for this action is here:
Now that redundant ring is in a supportable state, can you look into determining if this is still an issue? There are lingering reports about bridging not working with corosync and bridging should be investigated.
I've installed fc12 with corosync. My configuration looks like:
br0 with eth0
on two nodes.
Sadly, even I was changing order of redundant rings, I was not able to reproduce issue.
What mode of rrp were you using? What concrete actions you were doing to reproduce issue? Wasn't firewall somehow badly configured? Or switch? Were you using different mcast addresses for rings?
(In reply to comment #0)
> Description of problem:
long time ago ...
but I used to set rrp_mode : active.
No firewall at all.
Same mcast addresses for rings and same port (default:5405)
About concrete actions, nothing more than those described in my first note at creation of this defect.
(In reply to comment #4)
> Hi Jan,
> long time ago ...
> but I used to set rrp_mode : active.
> No firewall at all.
> Same mcast addresses for rings and same port (default:5405)
> About concrete actions, nothing more than those described in my first note at
> creation of this defect.
main problem seems to be that mcast addresses were same.
Also I was working very hard on getting RRP in proper shape, so I believe that this bug is fixed in RHEL-6 and also in currently supported Fedora package.
Closing as UPSTREAM, please feel free to reopen if you will hit this bug again.