Bug 636583 - corosync crashes ([TOTEM ] FAILED TO RECEIVE)
Summary: corosync crashes ([TOTEM ] FAILED TO RECEIVE)
Status: CLOSED UPSTREAM
Alias: None
Product: Corosync Cluster Engine
Classification: Retired
Component: totem   
(Show other bugs)
Version: 1.3
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
Assignee: Jan Friesse
QA Contact:
URL:
Whiteboard:
Keywords: Reopened
: 629431 636774 875922 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-09-22 15:58 UTC by Lorenzo Sartoratti
Modified: 2012-11-19 07:49 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-11-05 14:31:35 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Coredump of corosync (1.25 MB, application/zip)
2010-09-23 08:24 UTC, Lorenzo Sartoratti
no flags Details
Fplay ot the coredump (198.26 KB, application/zip)
2010-09-23 08:24 UTC, Lorenzo Sartoratti
no flags Details
patch which may fix the abort (1.05 KB, patch)
2011-02-07 20:20 UTC, Steven Dake
no flags Details | Diff
Proposed patch (1.72 KB, patch)
2012-10-24 12:15 UTC, Jan Friesse
no flags Details | Diff

Description Lorenzo Sartoratti 2010-09-22 15:58:18 UTC
Description of problem:
corosync process crash randomly in one of the cluster members

Version-Release number of selected component (if applicable):
1.2.7 and 1.2.8

How reproducible:
dont'k know
Last time was when I launched corosync-fplay on another member

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
I've attached the output from 'corosync-fplay' on the node that crashed

Comment 1 Steven Dake 2010-09-22 16:44:26 UTC
could you attach the core file (/var/lib/corosync/core) and tell us which corosync version your using (rpm -qi coroysnc)

Thanks

Comment 2 Lorenzo Sartoratti 2010-09-22 19:34:09 UTC
There's no core file in /var/lib/corosync
I'm using version 1.2.8

Comment 3 Lorenzo Sartoratti 2010-09-23 08:24:04 UTC
Created attachment 449147 [details]
Coredump of corosync

I've attached the complete dir generated by abrt of a corosync coredump

Comment 4 Lorenzo Sartoratti 2010-09-23 08:24:55 UTC
Created attachment 449148 [details]
Fplay ot the coredump

Comment 5 Lorenzo Sartoratti 2010-09-23 08:34:30 UTC
I've opened a new case with abrt : 636774

Comment 6 Lorenzo Sartoratti 2010-10-01 08:14:00 UTC
Hi,
you were right!
I've solved the problem reconfiguring the multicast part of the switches
where the four hosts are attached. Two hosts are connected on one and the other
two on the other. The switches are of two different manufacturers and the configuration is different. The main resolution was to force the interface between
the switches to forward traffic for the specific multicast address.
Thank you for your support!

Lorenzo Sartoratti

Comment 7 Steven Dake 2011-02-07 20:19:27 UTC
Lorenzo,

I'd be deeply indebted to you if you would try the attached patch in your environment (with the defective setup) and see if you continue to see aborts.

Regards
-steve

Comment 8 Steven Dake 2011-02-07 20:20:12 UTC
Created attachment 477496 [details]
patch which may fix the abort

Comment 10 Bug Zapper 2011-05-31 12:51:31 UTC
This message is a reminder that Fedora 13 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 13.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '13'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 13's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 13 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 11 Jan Friesse 2011-06-28 13:41:10 UTC
*** Bug 629431 has been marked as a duplicate of this bug. ***

Comment 12 Jan Friesse 2011-06-28 13:41:20 UTC
*** Bug 636774 has been marked as a duplicate of this bug. ***

Comment 14 Jan Friesse 2012-06-18 09:14:51 UTC
I believe this bug was fixed in Corosync 1.4.x (and 1.3.x zstream). Closing as upstream.

Comment 15 Steven Dake 2012-09-04 15:31:31 UTC
Honza,

Could you verify the patch in this bug is in upstream?  If not, can you try this patch on Bug #854216?

Comment 16 Jan Friesse 2012-10-24 12:15:17 UTC
Created attachment 632734 [details]
Proposed patch

Comment 17 Jan Friesse 2012-11-05 14:31:35 UTC
Proposed patch is now upstream as d4db2ea5353c8eedb64a88ae413c04e0757378c9 (or flatiron 81ff0e8c94589bb7139d89e573a75473cfc5d173)

Comment 18 Jan Friesse 2012-11-19 07:49:23 UTC
*** Bug 875922 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.