Bug 989934

Summary:	corosync 1.4.6 crash when an unpluged network cable is pluged back in udpu mode
Product:	Red Hat Enterprise Linux 6	Reporter:	Shining <nshi_nb>
Component:	corosync	Assignee:	Jan Friesse <jfriesse>
Status:	CLOSED DUPLICATE	QA Contact:	Cluster QE <mspqa-list>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	6.2	CC:	ccaulfie, cluster-maint, sdake
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-08-05 08:10:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Shining 2013-07-30 07:47:38 UTC

Description of problem:


Version-Release number of selected component (if applicable):
1.4.6


How reproducible:

Steps to Reproduce:
1. configure coroysnc in udpu mode
2. service corosync start
3. ifdown eth0 (or unplug network cable)
4. ifup   eth0 (or plugin network cable)


Actual results:
corosync is crashed.

Expected results:
the corosync back online

Additional info:

--corosync.conf--------------------------------
# Please read the corosync.conf.5 manual page
compatibility: whitetank

totem {
        version: 2
        secauth: off
        threads: 0
        interface {
                member {
                        memberaddr: 172.16.75.1
                }
                member {
                        memberaddr: 172.16.75.128
                }
                member {
                        memberaddr: 172.16.75.131
                }
                ringnumber: 0
                bindnetaddr: 172.16.75.128
                mcastport: 5495
                ttl: 1
        }
        transport: udpu
}

logging {
        fileline: off
        to_stderr: no
        to_logfile: yes
        to_syslog: no
        logfile: /var/log/corosync.log
        debug: on
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: on
        }
}

amf {
        mode: disabled
}
-----------------------------------------------

--gdb stack---------------------------------------
#0  0x00000033bdc32885 in raise () from /lib64/libc.so.6
#1  0x00000033bdc34065 in abort () from /lib64/libc.so.6
#2  0x00000033bdc2b9fe in __assert_fail_base () from /lib64/libc.so.6
#3  0x00000033bdc2bac0 in __assert_fail () from /lib64/libc.so.6
#4  0x00007f5e8102aa6c in memb_consensus_agreed (instance=0x7f5e7f39d010) at totemsrp.c:1244
#5  0x00007f5e8102ea1f in memb_join_process (instance=0x7f5e7f39d010, memb_join=0x172c220) at totemsrp.c:4066
#6  0x00007f5e8102edc9 in message_handler_memb_join (instance=0x7f5e7f39d010, msg=<value optimized out>, msg_len=<value optimized out>, endian_conversion_needed=<value optimized out>) at totemsrp.c:4311
#7  0x00007f5e810287e8 in rrp_deliver_fn (context=<value optimized out>, msg=0x172c220, msg_len=244) at totemrrp.c:1747
#8  0x00007f5e81025b3a in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0x172bb90) at totemudpu.c:1152
#9  0x00007f5e8101e482 in poll_run (handle=2697991128409440256) at coropoll.c:513
#10 0x00000000004072be in main (argc=<value optimized out>, argv=<value optimized out>, envp=<value optimized out>) at main.c:1927
-----------------------------------------------

Comment 2 Christine Caulfield 2013-07-30 12:10:19 UTC

Corosync 1.4.6 is not part of RHEL-6, it's version 1.4.1. Does this happen on a supported version?

Comment 3 Shining 2013-07-31 01:42:16 UTC

I build corosync-1.4.6 based on the lastest corosync source code and corosync src rpm package from rhel6. 
I will make another test on corosync-1.4.1 to make sure whether the bugs exists in 1.4.1.

Comment 4 Shining 2013-07-31 10:22:42 UTC

I am so sorry. This bug is caused by the service written by myself. After remove my service from corosync, the corosync works correct again.

Comment 5 Christine Caulfield 2013-07-31 12:14:51 UTC

No problem. Can you close this BZ then please :)

Comment 6 Shining 2013-08-01 06:23:32 UTC

The bug missing is an mistake. It is still there.
Because I had open the corefile flag in my service, I can get the corosync crash by the exist of corefile.
After remove my service, there's no corefile generated when corosync is crashed.

-----------------------------------------------------------------
Aug 01 14:09:45 corosync [TOTEM ] The network interface [172.20.0.128] is now up.
Aug 01 14:09:45 corosync [TOTEM ] adding new UDPU member {172.20.0.128}
my_failed_list 1 my_proc_list 2 token_memb_entries 1
Aug 01 14:09:45 corosync [TOTEM ] entering GATHER state from 15.
my_failed_list 1 my_proc_list 2 token_memb_entries 1
my_failed_list 1 my_proc_list 2 token_memb_entries 1
...
...
my_failed_list 1 my_proc_list 2 token_memb_entries 1
my_failed_list 2 my_proc_list 2 token_memb_entries 0
corosync: totemsrp.c:1258: memb_consensus_agreed: Assertion `token_memb_entries >= 1' failed.
Aug 01 14:09:46 corosync [TOTEM ] entering GATHER state from 0.
./myrun: line 3:  2003 Aborted                 (core dumped) ./corosync -f "$@"
-----------------------------------------------------------------

my_failed_list 1:
172.20.0.128
my_proc_list 2:
172.20.0.128
127.0.0.1

at the point crash:
my_failed_list 2:
172.20.0.128
127.0.0.1
my_proc_list 2:
172.20.0.128
127.0.0.1

Does the my_failed_list or my_proc_list need to be reinitialized after the network interface is up?

Comment 7 Shining 2013-08-01 06:29:30 UTC

---------------------
my_failed_list 1:
172.20.0.128
my_proc_list 2:
172.20.0.128
127.0.0.1
---------------------
should be
---------------------
my_failed_list 2:
172.20.0.128
127.0.0.1
my_proc_list 1:
172.20.0.128
---------------------

Comment 8 Jan Friesse 2013-08-05 08:10:45 UTC

Ifdown is unsupported. Only supported way to simulate failure is iptables drop (both uncast and multicast traffic) or unplug cable WITHOUT network manager (NM does ifdown on cable unplug).

Also this is clone of 881694.

*** This bug has been marked as a duplicate of bug 881694 ***