Bug 599327 - [RFE] - display diagnostic message in log and exit when local NIC fault or firewall enabled prevents totem from forming a cluster
Summary: [RFE] - display diagnostic message in log and exit when local NIC fault or fi...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync
Version: 6.0
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Jan Friesse
QA Contact: Cluster QE
URL:
Whiteboard:
: 600479 650033 (view as bug list)
Depends On:
Blocks: 560700
TreeView+ depends on / blocked
 
Reported: 2010-06-03 06:34 UTC by Fabio Massimo Di Nitto
Modified: 2011-12-06 11:49 UTC (History)
10 users (show)

Fixed In Version: corosync-1.4.0-1.el6
Doc Type: Enhancement
Doc Text:
Cause Multicast are blocked Consequence The cluster will partition in several smaller clusters. Each partition will lose quorum and never remerge. Customer has no easy way how to find these situation has happened. Change Print a diagnostic warning that the node can't exit the GATHER state. Result Message "TOTEM ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly." is printed. Also runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure key is set to 1.
Clone Of:
Environment:
Last Closed: 2011-12-06 11:49:55 UTC


Attachments (Terms of Use)
Proposed patch (4.45 KB, patch)
2010-12-02 14:07 UTC, Jan Friesse
no flags Details | Diff
runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure patch (1.86 KB, patch)
2011-01-10 13:50 UTC, Jan Friesse
no flags Details | Diff
Enhancement of previous patches (1.08 KB, patch)
2011-05-04 15:28 UTC, Jan Friesse
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:1515 normal SHIPPED_LIVE corosync bug fix and enhancement update 2011-12-06 00:38:47 UTC

Description Fabio Massimo Di Nitto 2010-06-03 06:34:39 UTC
N (where N > 2, didn´t test with 2 and I don´t think it applies either) nodes rhel6 cluster running latest packages (corosync 1.2.3-1 and cluster-3.0.12-3).

service iptables stop <- to allow multicast traffic

service start cman -> OK

"forget" to chkconfig iptables off

chkconfig cman on

fence_node nodeY (one node at random really)

Once nodeY will attempt to start corosync, the cluster will partition in several smaller clusters. Each partition will lose quorum and never remerge.

In my specific test case N was set to 32 and Y to 1.

While it is clearly a sysadmin error NOT to turn off iptables (or configure them properly), it is still highly undesirable that the cluster should die this way. A node that cannot rejoin the cluster properly should not cause the other nodes to stop operating properly.

Comment 6 Steven Dake 2010-09-30 16:26:55 UTC
*** Bug 600479 has been marked as a duplicate of this bug. ***

Comment 7 Steven Dake 2010-09-30 16:29:42 UTC
Honza,

We don't need to solve byzantine problem (that can't be solved with totem).  More interested in tracking down a fault of the local NIC or firewall enabled blocking traffic, resulting in the software running but not printing any diagnostic information.  

Please note there is some interesting discussion about this problem in Bug 600479 which is a dup of this bz.

Comment 8 Jan Friesse 2010-12-02 14:07:07 UTC
Created attachment 464254 [details]
Proposed patch

Comment 9 Steven Dake 2010-12-23 16:01:58 UTC
*** Bug 650033 has been marked as a duplicate of this bug. ***

Comment 10 Steven Dake 2011-01-04 22:34:35 UTC
Honza,

Chris Feist needs some diagnostic information in the objdb when this condition occurs.  It is blocking his progress on another bz related to watchdog integration.

Could you make a second patch and add a diag as follows:

runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure=1

when this condition is present

or =0 when not present.

Thanks!

Comment 11 Jan Friesse 2011-01-10 13:50:21 UTC
Created attachment 472592 [details]
runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure patch

Diagnostic information add

Comment 13 Jaroslav Kortus 2011-03-28 15:39:55 UTC
(10:23:51) [root@z4:~]$ iptables -I INPUT -p tcp --dport 22 -j ACCEPT
(10:23:55) [root@z4:~]$ iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
(10:23:59) [root@z4:~]$ iptables -A INPUT -j DROP
(10:24:08) [root@z4:~]$ service cman start

Mar 28 10:24:57 z4 corosync[2039]:   [TOTEM ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.

(10:26:05) [root@z4:~]$ corosync-objctl -a | grep firewall_enabled_or_nic_failure
runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure=1

Behaves as expected, marking as verified.

Comment 14 Corey Marthaler 2011-04-07 19:06:55 UTC
I'm seeing this message sometimes during the normal startup process. Maybe the tunable should be adjusted.

Apr  7 13:44:07 taft-03 corosync[21359]:   [TOTEM ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
Apr  7 13:44:17 taft-03 corosync[21359]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Apr  7 13:44:17 taft-03 corosync[21359]:   [CMAN  ] quorum regained, resuming activity
Apr  7 13:44:17 taft-03 corosync[21359]:   [QUORUM] This node is within the primary component and will provide service.
Apr  7 13:44:17 taft-03 corosync[21359]:   [QUORUM] Members[1]: 2
Apr  7 13:44:17 taft-03 corosync[21359]:   [CMAN  ] quorum lost, blocking activity
Apr  7 13:44:17 taft-03 corosync[21359]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Apr  7 13:44:17 taft-03 corosync[21359]:   [QUORUM] Members[1]: 2
Apr  7 13:44:17 taft-03 corosync[21359]:   [QUORUM] Members[2]: 2 3
Apr  7 13:44:17 taft-03 corosync[21359]:   [QUORUM] Members[2]: 2 3
Apr  7 13:44:17 taft-03 corosync[21359]:   [CMAN  ] quorum regained, resuming activity
Apr  7 13:44:17 taft-03 corosync[21359]:   [QUORUM] This node is within the primary component and will provide service.
Apr  7 13:44:17 taft-03 corosync[21359]:   [QUORUM] Members[3]: 2 3 4
Apr  7 13:44:17 taft-03 corosync[21359]:   [QUORUM] Members[3]: 2 3 4
Apr  7 13:44:17 taft-03 corosync[21359]:   [QUORUM] Members[4]: 1 2 3 4
Apr  7 13:44:17 taft-03 corosync[21359]:   [QUORUM] Members[4]: 1 2 3 4

Comment 25 Jan Friesse 2011-05-04 15:28:21 UTC
Created attachment 496821 [details]
Enhancement of previous patches

memb_state_gather_enter increase stats.continuous_gather only if
previous state was gather also. This should happen only if multicast is
not working properly (local firewall in most cases) and not if many
nodes joins at one time.

Comment 26 Jan Friesse 2011-05-05 09:06:37 UTC
Patch committed upstream as 61d83cd719bcc93a78eb5c718a138b96c325cc3e

Also one extra note. Originally we were thinking about 3 diagnostics, and it may seems that only one is implemented. They were:
- Continuous gather only from 1 state when firewall is on (this is implemented)
count message_handler_memb_join since last OPERATIONAL state, if zero, increase counter - This shouldn't be needed because it's done explicitly because we are gathering from different state
- reset counter on operational enter - was in original patch
- increase the constant for timeouts - This change only one thing and it's timeout before message is displayed/objdb key is set. Doesn't scale for bigger number of nodes.

Corey, can you please retest with patch applied?

Comment 27 Sayan Saha 2011-05-26 17:07:34 UTC
Granting pm_ack for inclusion in RHEL 6.2.

Comment 33 Jan Friesse 2011-10-31 07:58:46 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause
Multicast are blocked

Consequence
The cluster will partition in several smaller clusters. Each partition will lose quorum and never remerge. Customer has no easy way how to find these situation has happened.

Change
Print a diagnostic warning that the node can't exit the
GATHER state.

Result
Message "TOTEM ] Totem is unable to form a cluster
because of an operating system or network fault. The most common cause of this
message is that the local firewall is configured improperly." is printed.

Also runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure key is set to 1.

Comment 34 errata-xmlrpc 2011-12-06 11:49:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1515.html


Note You need to log in before you can comment on or make changes to this bug.