Bug 636583

Summary: corosync crashes ([TOTEM ] FAILED TO RECEIVE)
Product: [Retired] Corosync Cluster Engine Reporter: Lorenzo Sartoratti <lorenzo.sartoratti>
Component: totemAssignee: Jan Friesse <jfriesse>
Status: CLOSED UPSTREAM QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 1.3CC: agk, ari.tilli, asalkeld, fdinitto, jfriesse, mkelly, sdake, uwe.knop
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-11-05 14:31:35 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Coredump of corosync
none
Fplay ot the coredump
none
patch which may fix the abort
none
Proposed patch none

Description Lorenzo Sartoratti 2010-09-22 15:58:18 UTC
Description of problem:
corosync process crash randomly in one of the cluster members

Version-Release number of selected component (if applicable):
1.2.7 and 1.2.8

How reproducible:
dont'k know
Last time was when I launched corosync-fplay on another member

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
I've attached the output from 'corosync-fplay' on the node that crashed

Comment 1 Steven Dake 2010-09-22 16:44:26 UTC
could you attach the core file (/var/lib/corosync/core) and tell us which corosync version your using (rpm -qi coroysnc)

Thanks

Comment 2 Lorenzo Sartoratti 2010-09-22 19:34:09 UTC
There's no core file in /var/lib/corosync
I'm using version 1.2.8

Comment 3 Lorenzo Sartoratti 2010-09-23 08:24:04 UTC
Created attachment 449147 [details]
Coredump of corosync

I've attached the complete dir generated by abrt of a corosync coredump

Comment 4 Lorenzo Sartoratti 2010-09-23 08:24:55 UTC
Created attachment 449148 [details]
Fplay ot the coredump

Comment 5 Lorenzo Sartoratti 2010-09-23 08:34:30 UTC
I've opened a new case with abrt : 636774

Comment 6 Lorenzo Sartoratti 2010-10-01 08:14:00 UTC
Hi,
you were right!
I've solved the problem reconfiguring the multicast part of the switches
where the four hosts are attached. Two hosts are connected on one and the other
two on the other. The switches are of two different manufacturers and the configuration is different. The main resolution was to force the interface between
the switches to forward traffic for the specific multicast address.
Thank you for your support!

Lorenzo Sartoratti

Comment 7 Steven Dake 2011-02-07 20:19:27 UTC
Lorenzo,

I'd be deeply indebted to you if you would try the attached patch in your environment (with the defective setup) and see if you continue to see aborts.

Regards
-steve

Comment 8 Steven Dake 2011-02-07 20:20:12 UTC
Created attachment 477496 [details]
patch which may fix the abort

Comment 10 Bug Zapper 2011-05-31 12:51:31 UTC
This message is a reminder that Fedora 13 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 13.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '13'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 13's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 13 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 11 Jan Friesse 2011-06-28 13:41:10 UTC
*** Bug 629431 has been marked as a duplicate of this bug. ***

Comment 12 Jan Friesse 2011-06-28 13:41:20 UTC
*** Bug 636774 has been marked as a duplicate of this bug. ***

Comment 14 Jan Friesse 2012-06-18 09:14:51 UTC
I believe this bug was fixed in Corosync 1.4.x (and 1.3.x zstream). Closing as upstream.

Comment 15 Steven Dake 2012-09-04 15:31:31 UTC
Honza,

Could you verify the patch in this bug is in upstream?  If not, can you try this patch on Bug #854216?

Comment 16 Jan Friesse 2012-10-24 12:15:17 UTC
Created attachment 632734 [details]
Proposed patch

Comment 17 Jan Friesse 2012-11-05 14:31:35 UTC
Proposed patch is now upstream as d4db2ea5353c8eedb64a88ae413c04e0757378c9 (or flatiron 81ff0e8c94589bb7139d89e573a75473cfc5d173)

Comment 18 Jan Friesse 2012-11-19 07:49:23 UTC
*** Bug 875922 has been marked as a duplicate of this bug. ***