881694 – corosync process is heavy load, deadlocks in plug/unplug network cable test

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 881694 - corosync process is heavy load, deadlocks in plug/unplug network cable test

Summary: corosync process is heavy load, deadlocks in plug/unplug network cable test

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	corosync
Sub Component:
Version:	6.3
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Jan Friesse
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	883080 989934 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-11-29 11:17 UTC by Shining
Modified:	2019-02-15 13:31 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-08-11 14:31:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
corosync log files (1.32 MB, application/x-zip-compressed) 2012-11-29 11:17 UTC, Shining	no flags	Details
View All

Description Shining 2012-11-29 11:17:59 UTC

Created attachment 654195 [details]
corosync log files

Description of problem:


Version-Release number of selected component (if applicable):
[root@gcluster74 ~]# lsb_release -a
LSB Version:    :core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd
64:printing-4.0-noarch
Distributor ID: RedHatEnterpriseServer
Description:    Red Hat Enterprise Linux Server release 6.2 (Santiago)
Release:        6.2
Codename:       Santiago

[gbase@gcluster77 ~]$ rpm -qa | grep corosync
corosync-1.4.1-4.el6.x86_64
corosynclib-1.4.1-4.el6.x86_64

/etc/corosync/corosync.conf
--------------
totem {
        version: 2
        secauth: on
        threads: 0
        interface {
                ringnumber: 0
                bindnetaddr: 192.168.9.74
                mcastaddr: 226.94.1.9
                mcastport: 5498
                ttl: 1
        }
}

logging {
        fileline: off
        to_stderr: no
        to_logfile: yes
        to_syslog: yes
        logfile: /var/log/corosync.log
        debug: on   ## only 74 is on, other nodes is off
         timestamp: on
        logger_subsys {
                subsys: AMF
                debug: off
        }
}
--------------

How reproducible:
Four cluster Node, ip from 192.168.9.71 ~ 74
unplug the network cable on one or two node, wait a few second, then replugin the network cable.

Steps to Reproduce:
1. unplug the network cable
2. wait a few seconds
3. replugin the network cable
  
Actual results:
The cluster node in the infinite loop of Gather state 11.

Expected results:
The cluster node in the consistent status.


Additional info:
see the logfile attached.

Comment 2 Jan Friesse 2012-11-29 13:27:33 UTC

From logs:
The network interface is down.

Are you running NM? If so, please turn it off and use static configuration. Corosync has really big problems if interface is shutdown (at least route table change A LOT).

There is little better explanation: https://github.com/corosync/corosync/wiki/Corosync-and-ifdown-on-active-network-interface

It's in TODO to fix ifdown problem generally, but even it's quite high priority, there are bugs with even higher priority.

I will keep this BZ open as TODO.

Comment 3 Jan Friesse 2012-12-04 08:17:41 UTC

*** Bug 883080 has been marked as a duplicate of this bug. ***

Comment 6 Jan Friesse 2013-08-05 08:10:45 UTC

*** Bug 989934 has been marked as a duplicate of this bug. ***

Comment 10 Jan Friesse 2015-08-11 14:31:23 UTC

Proper solution of this bug means change in huge part of very sensitive code. Also bug has well known causes and workaround (don't test cluster failover by ifdown and don't use NetworkManager) so closing it as wontfix.

Note You need to log in before you can comment on or make changes to this bug.