599327 – [RFE] - display diagnostic message in log and exit when local NIC fault or firewall enabled prevents totem from forming a cluster

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 599327 - [RFE] - display diagnostic message in log and exit when local NIC fault or firewall enabled prevents totem from forming a cluster

Summary: [RFE] - display diagnostic message in log and exit when local NIC fault or fi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	corosync
Sub Component:
Version:	6.0
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Jan Friesse
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	600479 650033 (view as bug list)
Depends On:
Blocks:	560700
TreeView+	depends on / blocked

Reported:	2010-06-03 06:34 UTC by Fabio Massimo Di Nitto
Modified:	2011-12-06 11:49 UTC (History)
CC List:	10 users (show)
Fixed In Version:	corosync-1.4.0-1.el6
Doc Type:	Enhancement
Doc Text:	Cause Multicast are blocked Consequence The cluster will partition in several smaller clusters. Each partition will lose quorum and never remerge. Customer has no easy way how to find these situation has happened. Change Print a diagnostic warning that the node can't exit the GATHER state. Result Message "TOTEM ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly." is printed. Also runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure key is set to 1.
Clone Of:
Environment:
Last Closed:	2011-12-06 11:49:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Proposed patch (4.45 KB, patch) 2010-12-02 14:07 UTC, Jan Friesse	no flags	Details \| Diff
runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure patch (1.86 KB, patch) 2011-01-10 13:50 UTC, Jan Friesse	no flags	Details \| Diff
Enhancement of previous patches (1.08 KB, patch) 2011-05-04 15:28 UTC, Jan Friesse	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:1515	0	normal	SHIPPED_LIVE	corosync bug fix and enhancement update	2011-12-06 00:38:47 UTC

Description Fabio Massimo Di Nitto 2010-06-03 06:34:39 UTC

N (where N > 2, didn´t test with 2 and I don´t think it applies either) nodes rhel6 cluster running latest packages (corosync 1.2.3-1 and cluster-3.0.12-3).

service iptables stop <- to allow multicast traffic

service start cman -> OK

"forget" to chkconfig iptables off

chkconfig cman on

fence_node nodeY (one node at random really)

Once nodeY will attempt to start corosync, the cluster will partition in several smaller clusters. Each partition will lose quorum and never remerge.

In my specific test case N was set to 32 and Y to 1.

While it is clearly a sysadmin error NOT to turn off iptables (or configure them properly), it is still highly undesirable that the cluster should die this way. A node that cannot rejoin the cluster properly should not cause the other nodes to stop operating properly.

Comment 6 Steven Dake 2010-09-30 16:26:55 UTC

*** Bug 600479 has been marked as a duplicate of this bug. ***

Comment 7 Steven Dake 2010-09-30 16:29:42 UTC

Honza,

We don't need to solve byzantine problem (that can't be solved with totem).  More interested in tracking down a fault of the local NIC or firewall enabled blocking traffic, resulting in the software running but not printing any diagnostic information.  

Please note there is some interesting discussion about this problem in Bug 600479 which is a dup of this bz.

Comment 8 Jan Friesse 2010-12-02 14:07:07 UTC

Created attachment 464254 [details]
Proposed patch

Comment 9 Steven Dake 2010-12-23 16:01:58 UTC

*** Bug 650033 has been marked as a duplicate of this bug. ***

Comment 10 Steven Dake 2011-01-04 22:34:35 UTC

Honza,

Chris Feist needs some diagnostic information in the objdb when this condition occurs.  It is blocking his progress on another bz related to watchdog integration.

Could you make a second patch and add a diag as follows:

runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure=1

when this condition is present

or =0 when not present.

Thanks!

Comment 11 Jan Friesse 2011-01-10 13:50:21 UTC

Created attachment 472592 [details]
runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure patch

Diagnostic information add

Comment 13 Jaroslav Kortus 2011-03-28 15:39:55 UTC

(10:23:51) [root@z4:~]$ iptables -I INPUT -p tcp --dport 22 -j ACCEPT
(10:23:55) [root@z4:~]$ iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
(10:23:59) [root@z4:~]$ iptables -A INPUT -j DROP
(10:24:08) [root@z4:~]$ service cman start

Mar 28 10:24:57 z4 corosync[2039]:   [TOTEM ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.

(10:26:05) [root@z4:~]$ corosync-objctl -a | grep firewall_enabled_or_nic_failure
runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure=1

Behaves as expected, marking as verified.

Comment 14 Corey Marthaler 2011-04-07 19:06:55 UTC

I'm seeing this message sometimes during the normal startup process. Maybe the tunable should be adjusted.

Apr  7 13:44:07 taft-03 corosync[21359]:   [TOTEM ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
Apr  7 13:44:17 taft-03 corosync[21359]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Apr  7 13:44:17 taft-03 corosync[21359]:   [CMAN  ] quorum regained, resuming activity
Apr  7 13:44:17 taft-03 corosync[21359]:   [QUORUM] This node is within the primary component and will provide service.
Apr  7 13:44:17 taft-03 corosync[21359]:   [QUORUM] Members[1]: 2
Apr  7 13:44:17 taft-03 corosync[21359]:   [CMAN  ] quorum lost, blocking activity
Apr  7 13:44:17 taft-03 corosync[21359]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Apr  7 13:44:17 taft-03 corosync[21359]:   [QUORUM] Members[1]: 2
Apr  7 13:44:17 taft-03 corosync[21359]:   [QUORUM] Members[2]: 2 3
Apr  7 13:44:17 taft-03 corosync[21359]:   [QUORUM] Members[2]: 2 3
Apr  7 13:44:17 taft-03 corosync[21359]:   [CMAN  ] quorum regained, resuming activity
Apr  7 13:44:17 taft-03 corosync[21359]:   [QUORUM] This node is within the primary component and will provide service.
Apr  7 13:44:17 taft-03 corosync[21359]:   [QUORUM] Members[3]: 2 3 4
Apr  7 13:44:17 taft-03 corosync[21359]:   [QUORUM] Members[3]: 2 3 4
Apr  7 13:44:17 taft-03 corosync[21359]:   [QUORUM] Members[4]: 1 2 3 4
Apr  7 13:44:17 taft-03 corosync[21359]:   [QUORUM] Members[4]: 1 2 3 4

Comment 25 Jan Friesse 2011-05-04 15:28:21 UTC

Created attachment 496821 [details]
Enhancement of previous patches

memb_state_gather_enter increase stats.continuous_gather only if
previous state was gather also. This should happen only if multicast is
not working properly (local firewall in most cases) and not if many
nodes joins at one time.

Comment 26 Jan Friesse 2011-05-05 09:06:37 UTC

Patch committed upstream as 61d83cd719bcc93a78eb5c718a138b96c325cc3e

Also one extra note. Originally we were thinking about 3 diagnostics, and it may seems that only one is implemented. They were:
- Continuous gather only from 1 state when firewall is on (this is implemented)
count message_handler_memb_join since last OPERATIONAL state, if zero, increase counter - This shouldn't be needed because it's done explicitly because we are gathering from different state
- reset counter on operational enter - was in original patch
- increase the constant for timeouts - This change only one thing and it's timeout before message is displayed/objdb key is set. Doesn't scale for bigger number of nodes.

Corey, can you please retest with patch applied?

Comment 27 Sayan Saha 2011-05-26 17:07:34 UTC

Granting pm_ack for inclusion in RHEL 6.2.

Comment 33 Jan Friesse 2011-10-31 07:58:46 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause
Multicast are blocked

Consequence
The cluster will partition in several smaller clusters. Each partition will lose quorum and never remerge. Customer has no easy way how to find these situation has happened.

Change
Print a diagnostic warning that the node can't exit the
GATHER state.

Result
Message "TOTEM ] Totem is unable to form a cluster
because of an operating system or network fault. The most common cause of this
message is that the local firewall is configured improperly." is printed.

Also runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure key is set to 1.

Comment 34 errata-xmlrpc 2011-12-06 11:49:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1515.html

Note You need to log in before you can comment on or make changes to this bug.