Bug 599327
Summary: | [RFE] - display diagnostic message in log and exit when local NIC fault or firewall enabled prevents totem from forming a cluster | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Fabio Massimo Di Nitto <fdinitto> | ||||||||
Component: | corosync | Assignee: | Jan Friesse <jfriesse> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | low | ||||||||||
Version: | 6.0 | CC: | cfeist, cluster-maint, cmarthal, djansa, jkortus, mkelly, sdake, snagar, ssaha, syeghiay | ||||||||
Target Milestone: | rc | Keywords: | FutureFeature | ||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | corosync-1.4.0-1.el6 | Doc Type: | Enhancement | ||||||||
Doc Text: |
Cause
Multicast are blocked
Consequence
The cluster will partition in several smaller clusters. Each partition will lose quorum and never remerge. Customer has no easy way how to find these situation has happened.
Change
Print a diagnostic warning that the node can't exit the
GATHER state.
Result
Message "TOTEM ] Totem is unable to form a cluster
because of an operating system or network fault. The most common cause of this
message is that the local firewall is configured improperly." is printed.
Also runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure key is set to 1.
|
Story Points: | --- | ||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2011-12-06 11:49:55 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 560700 | ||||||||||
Attachments: |
|
Description
Fabio Massimo Di Nitto
2010-06-03 06:34:39 UTC
*** Bug 600479 has been marked as a duplicate of this bug. *** Honza, We don't need to solve byzantine problem (that can't be solved with totem). More interested in tracking down a fault of the local NIC or firewall enabled blocking traffic, resulting in the software running but not printing any diagnostic information. Please note there is some interesting discussion about this problem in Bug 600479 which is a dup of this bz. Created attachment 464254 [details]
Proposed patch
*** Bug 650033 has been marked as a duplicate of this bug. *** Honza, Chris Feist needs some diagnostic information in the objdb when this condition occurs. It is blocking his progress on another bz related to watchdog integration. Could you make a second patch and add a diag as follows: runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure=1 when this condition is present or =0 when not present. Thanks! Created attachment 472592 [details]
runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure patch
Diagnostic information add
(10:23:51) [root@z4:~]$ iptables -I INPUT -p tcp --dport 22 -j ACCEPT (10:23:55) [root@z4:~]$ iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT (10:23:59) [root@z4:~]$ iptables -A INPUT -j DROP (10:24:08) [root@z4:~]$ service cman start Mar 28 10:24:57 z4 corosync[2039]: [TOTEM ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly. (10:26:05) [root@z4:~]$ corosync-objctl -a | grep firewall_enabled_or_nic_failure runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure=1 Behaves as expected, marking as verified. I'm seeing this message sometimes during the normal startup process. Maybe the tunable should be adjusted. Apr 7 13:44:07 taft-03 corosync[21359]: [TOTEM ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly. Apr 7 13:44:17 taft-03 corosync[21359]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Apr 7 13:44:17 taft-03 corosync[21359]: [CMAN ] quorum regained, resuming activity Apr 7 13:44:17 taft-03 corosync[21359]: [QUORUM] This node is within the primary component and will provide service. Apr 7 13:44:17 taft-03 corosync[21359]: [QUORUM] Members[1]: 2 Apr 7 13:44:17 taft-03 corosync[21359]: [CMAN ] quorum lost, blocking activity Apr 7 13:44:17 taft-03 corosync[21359]: [QUORUM] This node is within the non-primary component and will NOT provide any services. Apr 7 13:44:17 taft-03 corosync[21359]: [QUORUM] Members[1]: 2 Apr 7 13:44:17 taft-03 corosync[21359]: [QUORUM] Members[2]: 2 3 Apr 7 13:44:17 taft-03 corosync[21359]: [QUORUM] Members[2]: 2 3 Apr 7 13:44:17 taft-03 corosync[21359]: [CMAN ] quorum regained, resuming activity Apr 7 13:44:17 taft-03 corosync[21359]: [QUORUM] This node is within the primary component and will provide service. Apr 7 13:44:17 taft-03 corosync[21359]: [QUORUM] Members[3]: 2 3 4 Apr 7 13:44:17 taft-03 corosync[21359]: [QUORUM] Members[3]: 2 3 4 Apr 7 13:44:17 taft-03 corosync[21359]: [QUORUM] Members[4]: 1 2 3 4 Apr 7 13:44:17 taft-03 corosync[21359]: [QUORUM] Members[4]: 1 2 3 4 Created attachment 496821 [details]
Enhancement of previous patches
memb_state_gather_enter increase stats.continuous_gather only if
previous state was gather also. This should happen only if multicast is
not working properly (local firewall in most cases) and not if many
nodes joins at one time.
Patch committed upstream as 61d83cd719bcc93a78eb5c718a138b96c325cc3e Also one extra note. Originally we were thinking about 3 diagnostics, and it may seems that only one is implemented. They were: - Continuous gather only from 1 state when firewall is on (this is implemented) count message_handler_memb_join since last OPERATIONAL state, if zero, increase counter - This shouldn't be needed because it's done explicitly because we are gathering from different state - reset counter on operational enter - was in original patch - increase the constant for timeouts - This change only one thing and it's timeout before message is displayed/objdb key is set. Doesn't scale for bigger number of nodes. Corey, can you please retest with patch applied? Granting pm_ack for inclusion in RHEL 6.2. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause Multicast are blocked Consequence The cluster will partition in several smaller clusters. Each partition will lose quorum and never remerge. Customer has no easy way how to find these situation has happened. Change Print a diagnostic warning that the node can't exit the GATHER state. Result Message "TOTEM ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly." is printed. Also runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure key is set to 1. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1515.html |