Red Hat Bugzilla – Bug 599327
[RFE] - display diagnostic message in log and exit when local NIC fault or firewall enabled prevents totem from forming a cluster
Last modified: 2011-12-06 06:49:55 EST
N (where N > 2, didn´t test with 2 and I don´t think it applies either) nodes rhel6 cluster running latest packages (corosync 1.2.3-1 and cluster-3.0.12-3). service iptables stop <- to allow multicast traffic service start cman -> OK "forget" to chkconfig iptables off chkconfig cman on fence_node nodeY (one node at random really) Once nodeY will attempt to start corosync, the cluster will partition in several smaller clusters. Each partition will lose quorum and never remerge. In my specific test case N was set to 32 and Y to 1. While it is clearly a sysadmin error NOT to turn off iptables (or configure them properly), it is still highly undesirable that the cluster should die this way. A node that cannot rejoin the cluster properly should not cause the other nodes to stop operating properly.
*** Bug 600479 has been marked as a duplicate of this bug. ***
Honza, We don't need to solve byzantine problem (that can't be solved with totem). More interested in tracking down a fault of the local NIC or firewall enabled blocking traffic, resulting in the software running but not printing any diagnostic information. Please note there is some interesting discussion about this problem in Bug 600479 which is a dup of this bz.
Created attachment 464254 [details] Proposed patch
*** Bug 650033 has been marked as a duplicate of this bug. ***
Honza, Chris Feist needs some diagnostic information in the objdb when this condition occurs. It is blocking his progress on another bz related to watchdog integration. Could you make a second patch and add a diag as follows: runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure=1 when this condition is present or =0 when not present. Thanks!
Created attachment 472592 [details] runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure patch Diagnostic information add
(10:23:51) [root@z4:~]$ iptables -I INPUT -p tcp --dport 22 -j ACCEPT (10:23:55) [root@z4:~]$ iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT (10:23:59) [root@z4:~]$ iptables -A INPUT -j DROP (10:24:08) [root@z4:~]$ service cman start Mar 28 10:24:57 z4 corosync[2039]: [TOTEM ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly. (10:26:05) [root@z4:~]$ corosync-objctl -a | grep firewall_enabled_or_nic_failure runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure=1 Behaves as expected, marking as verified.
I'm seeing this message sometimes during the normal startup process. Maybe the tunable should be adjusted. Apr 7 13:44:07 taft-03 corosync[21359]: [TOTEM ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly. Apr 7 13:44:17 taft-03 corosync[21359]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Apr 7 13:44:17 taft-03 corosync[21359]: [CMAN ] quorum regained, resuming activity Apr 7 13:44:17 taft-03 corosync[21359]: [QUORUM] This node is within the primary component and will provide service. Apr 7 13:44:17 taft-03 corosync[21359]: [QUORUM] Members[1]: 2 Apr 7 13:44:17 taft-03 corosync[21359]: [CMAN ] quorum lost, blocking activity Apr 7 13:44:17 taft-03 corosync[21359]: [QUORUM] This node is within the non-primary component and will NOT provide any services. Apr 7 13:44:17 taft-03 corosync[21359]: [QUORUM] Members[1]: 2 Apr 7 13:44:17 taft-03 corosync[21359]: [QUORUM] Members[2]: 2 3 Apr 7 13:44:17 taft-03 corosync[21359]: [QUORUM] Members[2]: 2 3 Apr 7 13:44:17 taft-03 corosync[21359]: [CMAN ] quorum regained, resuming activity Apr 7 13:44:17 taft-03 corosync[21359]: [QUORUM] This node is within the primary component and will provide service. Apr 7 13:44:17 taft-03 corosync[21359]: [QUORUM] Members[3]: 2 3 4 Apr 7 13:44:17 taft-03 corosync[21359]: [QUORUM] Members[3]: 2 3 4 Apr 7 13:44:17 taft-03 corosync[21359]: [QUORUM] Members[4]: 1 2 3 4 Apr 7 13:44:17 taft-03 corosync[21359]: [QUORUM] Members[4]: 1 2 3 4
Created attachment 496821 [details] Enhancement of previous patches memb_state_gather_enter increase stats.continuous_gather only if previous state was gather also. This should happen only if multicast is not working properly (local firewall in most cases) and not if many nodes joins at one time.
Patch committed upstream as 61d83cd719bcc93a78eb5c718a138b96c325cc3e Also one extra note. Originally we were thinking about 3 diagnostics, and it may seems that only one is implemented. They were: - Continuous gather only from 1 state when firewall is on (this is implemented) count message_handler_memb_join since last OPERATIONAL state, if zero, increase counter - This shouldn't be needed because it's done explicitly because we are gathering from different state - reset counter on operational enter - was in original patch - increase the constant for timeouts - This change only one thing and it's timeout before message is displayed/objdb key is set. Doesn't scale for bigger number of nodes. Corey, can you please retest with patch applied?
Granting pm_ack for inclusion in RHEL 6.2.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause Multicast are blocked Consequence The cluster will partition in several smaller clusters. Each partition will lose quorum and never remerge. Customer has no easy way how to find these situation has happened. Change Print a diagnostic warning that the node can't exit the GATHER state. Result Message "TOTEM ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly." is printed. Also runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure key is set to 1.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1515.html