Bug 990623

Summary: corosync does not notice network down
Product: Red Hat Enterprise Linux 6 Reporter: michal novacek <mnovacek>
Component: corosyncAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED NOTABUG QA Contact: Cluster QE <mspqa-list>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.4CC: ccaulfie, cluster-maint, dvossel, jfriesse, sdake
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 991412 (view as bug list) Environment:
Last Closed: 2013-08-05 08:14:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 991412    
Attachments:
Description Flags
relevant part of corosync log. none

Description michal novacek 2013-07-31 14:59:32 UTC
Created attachment 781194 [details]
relevant part of corosync log.

Description of problem:
I have a pacemaker+corosync cluster of 3 virtual nodes with no services. I run
at the same time on all of them script dropping all communication on INPUT ond
OUTPUT chains except ssh. Pacemaker never notice that network is unreacheable
and reports all the nodes online and cluster quorate. Corosync does seem to
notice though.


Version-Release number of selected component (if applicable):
corosync-1.4.1-15.el6_4.1.x86_64
pacemaker-1.1.8-7.el6.x86_64

How reproducible: always

Steps to Reproduce:
1. setup corosync+pacemaker cluster
2. add DROP rule to INPUT and OUTPUT chains in iptables on all nodes at the
same time.

Actual results: pacemaker would not notice anything wrong.

Expected results: pacemaker noticing and each node forms an inquorate island

Additional info:
Node in the cluster:

# pcs status
Last updated: Wed Jul 31 16:43:25 2013
Last change: Wed Jul 31 15:09:27 2013 via crm_resource on virt-064
Stack: cman
Current DC: virt-070 - partition with quorum
Version: 1.1.8-7.el6-394e906
3 Nodes configured, unknown expected votes
1 Resources configured.


Online: [ virt-064 virt-065 virt-070 ]

Full list of resources:

 virt-fencing   (stonith:fence_xvm):    Stopped 

# iptables-save
# Generated by iptables-save v1.4.7 on Wed Jul 31 16:50:16 2013
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A INPUT -p tcp -m tcp --dport 22 -j ACCEPT 
-A INPUT -j DROP 
-A OUTPUT -p tcp -m tcp --sport 22 -m state --state ESTABLISHED -j ACCEPT 
-A OUTPUT -j DROP 
COMMIT
# Completed on Wed Jul 31 16:50:16 2013

# cat /etc/cluster/cluster.conf 
<?xml version="1.0"?>
<cluster config_version="1" name="STSRHTS7932">
        <cman/>
        <totem token="3000"/>
        <fence_daemon clean_start="0" post_join_delay="20"/>
        <clusternodes>
                <clusternode name="virt-064" nodeid="1" votes="1">
                        <fence>
                                <method name="pcmk-redirect">
                                        <device name="pcmk" port="virt-064"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="virt-065" nodeid="2" votes="1">
                        <fence>
                                <method name="pcmk-redirect">
                                        <device name="pcmk" port="virt-065"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="virt-070" nodeid="3" votes="1">
                        <fence>
                                <method name="pcmk-redirect">
                                        <device name="pcmk" port="virt-070"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice agent="fence_pcmk" name="pcmk"/>
        </fencedevices>
</cluster>

There is a lot of "Totem is unable to form a cluster because of an operating
system or network fault. The most common cause of this message is that the
local firewall is configured improperly." messages in corosync.log so corosync
seems to know about the problem.

Comment 2 michal novacek 2013-08-02 11:38:45 UTC
The same behaviour happens with having -j DROP on INPUT only.

Comment 3 Jan Friesse 2013-08-05 08:14:46 UTC
Filtering only INPUT (or OUTPUT) by iptables is unsupported. Also at least localhost must not be filtered otherwise corosync doesn't work and will be unable to create single node membership. This is configuration error, not a bug.