Bug 881694

Summary: corosync process is heavy load, deadlocks in plug/unplug network cable test
Product: Red Hat Enterprise Linux 6 Reporter: Shining <nshi_nb>
Component: corosyncAssignee: Jan Friesse <jfriesse>
Status: CLOSED WONTFIX QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.3CC: fdinitto, jkortus, pzimek
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-08-11 14:31:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
corosync log files none

Description Shining 2012-11-29 11:17:59 UTC
Created attachment 654195 [details]
corosync log files

Description of problem:


Version-Release number of selected component (if applicable):
[root@gcluster74 ~]# lsb_release -a
LSB Version:    :core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd
64:printing-4.0-noarch
Distributor ID: RedHatEnterpriseServer
Description:    Red Hat Enterprise Linux Server release 6.2 (Santiago)
Release:        6.2
Codename:       Santiago

[gbase@gcluster77 ~]$ rpm -qa | grep corosync
corosync-1.4.1-4.el6.x86_64
corosynclib-1.4.1-4.el6.x86_64

/etc/corosync/corosync.conf
--------------
totem {
        version: 2
        secauth: on
        threads: 0
        interface {
                ringnumber: 0
                bindnetaddr: 192.168.9.74
                mcastaddr: 226.94.1.9
                mcastport: 5498
                ttl: 1
        }
}

logging {
        fileline: off
        to_stderr: no
        to_logfile: yes
        to_syslog: yes
        logfile: /var/log/corosync.log
        debug: on   ## only 74 is on, other nodes is off
         timestamp: on
        logger_subsys {
                subsys: AMF
                debug: off
        }
}
--------------

How reproducible:
Four cluster Node, ip from 192.168.9.71 ~ 74
unplug the network cable on one or two node, wait a few second, then replugin the network cable.

Steps to Reproduce:
1. unplug the network cable
2. wait a few seconds
3. replugin the network cable
  
Actual results:
The cluster node in the infinite loop of Gather state 11.

Expected results:
The cluster node in the consistent status.


Additional info:
see the logfile attached.

Comment 2 Jan Friesse 2012-11-29 13:27:33 UTC
From logs:
The network interface is down.

Are you running NM? If so, please turn it off and use static configuration. Corosync has really big problems if interface is shutdown (at least route table change A LOT).

There is little better explanation: https://github.com/corosync/corosync/wiki/Corosync-and-ifdown-on-active-network-interface

It's in TODO to fix ifdown problem generally, but even it's quite high priority, there are bugs with even higher priority.

I will keep this BZ open as TODO.

Comment 3 Jan Friesse 2012-12-04 08:17:41 UTC
*** Bug 883080 has been marked as a duplicate of this bug. ***

Comment 6 Jan Friesse 2013-08-05 08:10:45 UTC
*** Bug 989934 has been marked as a duplicate of this bug. ***

Comment 10 Jan Friesse 2015-08-11 14:31:23 UTC
Proper solution of this bug means change in huge part of very sensitive code. Also bug has well known causes and workaround (don't test cluster failover by ifdown and don't use NetworkManager) so closing it as wontfix.