Bug 881694 - corosync process is heavy load, deadlocks in plug/unplug network cable test
Summary: corosync process is heavy load, deadlocks in plug/unplug network cable test
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync
Version: 6.3
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Jan Friesse
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
: 883080 989934 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-11-29 11:17 UTC by Shining
Modified: 2019-02-15 13:31 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-08-11 14:31:23 UTC
Target Upstream Version:


Attachments (Terms of Use)
corosync log files (1.32 MB, application/x-zip-compressed)
2012-11-29 11:17 UTC, Shining
no flags Details

Description Shining 2012-11-29 11:17:59 UTC
Created attachment 654195 [details]
corosync log files

Description of problem:


Version-Release number of selected component (if applicable):
[root@gcluster74 ~]# lsb_release -a
LSB Version:    :core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd
64:printing-4.0-noarch
Distributor ID: RedHatEnterpriseServer
Description:    Red Hat Enterprise Linux Server release 6.2 (Santiago)
Release:        6.2
Codename:       Santiago

[gbase@gcluster77 ~]$ rpm -qa | grep corosync
corosync-1.4.1-4.el6.x86_64
corosynclib-1.4.1-4.el6.x86_64

/etc/corosync/corosync.conf
--------------
totem {
        version: 2
        secauth: on
        threads: 0
        interface {
                ringnumber: 0
                bindnetaddr: 192.168.9.74
                mcastaddr: 226.94.1.9
                mcastport: 5498
                ttl: 1
        }
}

logging {
        fileline: off
        to_stderr: no
        to_logfile: yes
        to_syslog: yes
        logfile: /var/log/corosync.log
        debug: on   ## only 74 is on, other nodes is off
         timestamp: on
        logger_subsys {
                subsys: AMF
                debug: off
        }
}
--------------

How reproducible:
Four cluster Node, ip from 192.168.9.71 ~ 74
unplug the network cable on one or two node, wait a few second, then replugin the network cable.

Steps to Reproduce:
1. unplug the network cable
2. wait a few seconds
3. replugin the network cable
  
Actual results:
The cluster node in the infinite loop of Gather state 11.

Expected results:
The cluster node in the consistent status.


Additional info:
see the logfile attached.

Comment 2 Jan Friesse 2012-11-29 13:27:33 UTC
From logs:
The network interface is down.

Are you running NM? If so, please turn it off and use static configuration. Corosync has really big problems if interface is shutdown (at least route table change A LOT).

There is little better explanation: https://github.com/corosync/corosync/wiki/Corosync-and-ifdown-on-active-network-interface

It's in TODO to fix ifdown problem generally, but even it's quite high priority, there are bugs with even higher priority.

I will keep this BZ open as TODO.

Comment 3 Jan Friesse 2012-12-04 08:17:41 UTC
*** Bug 883080 has been marked as a duplicate of this bug. ***

Comment 6 Jan Friesse 2013-08-05 08:10:45 UTC
*** Bug 989934 has been marked as a duplicate of this bug. ***

Comment 10 Jan Friesse 2015-08-11 14:31:23 UTC
Proper solution of this bug means change in huge part of very sensitive code. Also bug has well known causes and workaround (don't test cluster failover by ifdown and don't use NetworkManager) so closing it as wontfix.


Note You need to log in before you can comment on or make changes to this bug.