989934 – corosync 1.4.6 crash when an unpluged network cable is pluged back in udpu mode

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 989934 - corosync 1.4.6 crash when an unpluged network cable is pluged back in udpu mode

Summary: corosync 1.4.6 crash when an unpluged network cable is pluged back in udpu mode

Keywords:
Status:	CLOSED DUPLICATE of bug 881694
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	corosync
Sub Component:
Version:	6.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Jan Friesse
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-07-30 07:47 UTC by Shining
Modified:	2013-08-05 08:10 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-08-05 08:10:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Shining 2013-07-30 07:47:38 UTC

Description of problem:


Version-Release number of selected component (if applicable):
1.4.6


How reproducible:

Steps to Reproduce:
1. configure coroysnc in udpu mode
2. service corosync start
3. ifdown eth0 (or unplug network cable)
4. ifup   eth0 (or plugin network cable)


Actual results:
corosync is crashed.

Expected results:
the corosync back online

Additional info:

--corosync.conf--------------------------------
# Please read the corosync.conf.5 manual page
compatibility: whitetank

totem {
        version: 2
        secauth: off
        threads: 0
        interface {
                member {
                        memberaddr: 172.16.75.1
                }
                member {
                        memberaddr: 172.16.75.128
                }
                member {
                        memberaddr: 172.16.75.131
                }
                ringnumber: 0
                bindnetaddr: 172.16.75.128
                mcastport: 5495
                ttl: 1
        }
        transport: udpu
}

logging {
        fileline: off
        to_stderr: no
        to_logfile: yes
        to_syslog: no
        logfile: /var/log/corosync.log
        debug: on
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: on
        }
}

amf {
        mode: disabled
}
-----------------------------------------------

--gdb stack---------------------------------------
#0  0x00000033bdc32885 in raise () from /lib64/libc.so.6
#1  0x00000033bdc34065 in abort () from /lib64/libc.so.6
#2  0x00000033bdc2b9fe in __assert_fail_base () from /lib64/libc.so.6
#3  0x00000033bdc2bac0 in __assert_fail () from /lib64/libc.so.6
#4  0x00007f5e8102aa6c in memb_consensus_agreed (instance=0x7f5e7f39d010) at totemsrp.c:1244
#5  0x00007f5e8102ea1f in memb_join_process (instance=0x7f5e7f39d010, memb_join=0x172c220) at totemsrp.c:4066
#6  0x00007f5e8102edc9 in message_handler_memb_join (instance=0x7f5e7f39d010, msg=<value optimized out>, msg_len=<value optimized out>, endian_conversion_needed=<value optimized out>) at totemsrp.c:4311
#7  0x00007f5e810287e8 in rrp_deliver_fn (context=<value optimized out>, msg=0x172c220, msg_len=244) at totemrrp.c:1747
#8  0x00007f5e81025b3a in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0x172bb90) at totemudpu.c:1152
#9  0x00007f5e8101e482 in poll_run (handle=2697991128409440256) at coropoll.c:513
#10 0x00000000004072be in main (argc=<value optimized out>, argv=<value optimized out>, envp=<value optimized out>) at main.c:1927
-----------------------------------------------

Comment 2 Christine Caulfield 2013-07-30 12:10:19 UTC

Corosync 1.4.6 is not part of RHEL-6, it's version 1.4.1. Does this happen on a supported version?

Comment 3 Shining 2013-07-31 01:42:16 UTC

I build corosync-1.4.6 based on the lastest corosync source code and corosync src rpm package from rhel6. 
I will make another test on corosync-1.4.1 to make sure whether the bugs exists in 1.4.1.

Comment 4 Shining 2013-07-31 10:22:42 UTC

I am so sorry. This bug is caused by the service written by myself. After remove my service from corosync, the corosync works correct again.

Comment 5 Christine Caulfield 2013-07-31 12:14:51 UTC

No problem. Can you close this BZ then please :)

Comment 6 Shining 2013-08-01 06:23:32 UTC

The bug missing is an mistake. It is still there.
Because I had open the corefile flag in my service, I can get the corosync crash by the exist of corefile.
After remove my service, there's no corefile generated when corosync is crashed.

-----------------------------------------------------------------
Aug 01 14:09:45 corosync [TOTEM ] The network interface [172.20.0.128] is now up.
Aug 01 14:09:45 corosync [TOTEM ] adding new UDPU member {172.20.0.128}
my_failed_list 1 my_proc_list 2 token_memb_entries 1
Aug 01 14:09:45 corosync [TOTEM ] entering GATHER state from 15.
my_failed_list 1 my_proc_list 2 token_memb_entries 1
my_failed_list 1 my_proc_list 2 token_memb_entries 1
...
...
my_failed_list 1 my_proc_list 2 token_memb_entries 1
my_failed_list 2 my_proc_list 2 token_memb_entries 0
corosync: totemsrp.c:1258: memb_consensus_agreed: Assertion `token_memb_entries >= 1' failed.
Aug 01 14:09:46 corosync [TOTEM ] entering GATHER state from 0.
./myrun: line 3:  2003 Aborted                 (core dumped) ./corosync -f "$@"
-----------------------------------------------------------------

my_failed_list 1:
172.20.0.128
my_proc_list 2:
172.20.0.128
127.0.0.1

at the point crash:
my_failed_list 2:
172.20.0.128
127.0.0.1
my_proc_list 2:
172.20.0.128
127.0.0.1

Does the my_failed_list or my_proc_list need to be reinitialized after the network interface is up?

Comment 7 Shining 2013-08-01 06:29:30 UTC

---------------------
my_failed_list 1:
172.20.0.128
my_proc_list 2:
172.20.0.128
127.0.0.1
---------------------
should be
---------------------
my_failed_list 2:
172.20.0.128
127.0.0.1
my_proc_list 1:
172.20.0.128
---------------------

Comment 8 Jan Friesse 2013-08-05 08:10:45 UTC

Ifdown is unsupported. Only supported way to simulate failure is iptables drop (both uncast and multicast traffic) or unplug cable WITHOUT network manager (NM does ifdown on cable unplug).

Also this is clone of 881694.

*** This bug has been marked as a duplicate of bug 881694 ***

Note You need to log in before you can comment on or make changes to this bug.