Bug 1153818

Summary: Cluster fails to start when nodes that aren't in the cluster, think they are and are running corosync
Product: Red Hat Enterprise Linux 7 Reporter: Chris Feist <cfeist>
Component: corosyncAssignee: Jan Friesse <jfriesse>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact:
Priority: high    
Version: 7.2CC: ccaulfie, cfeist, cluster-maint, jfriesse, jkortus, mnovacek, phagara
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: corosync-2.4.5-1.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-31 19:54:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1205796    
Attachments:
Description Flags
udpu: Drop packets from unlisted IPs
none
man: Enahnce block_unlisted_ips description none

Description Chris Feist 2014-10-16 21:20:36 UTC
Description of problem:
Cluster fails to start when nodes that aren't in the cluster, think they are and are running corosync

Version-Release number of selected component (if applicable):
corosync-2.3.3-2.el7.x86_64

How reproducible:
Always (as far as I can tel)

Steps to Reproduce:
1.  Create a cluster of 3 nodes (but don't start it)
2.  Take the corosync.conf from one of the nodes and stick it on another machine
3.  Start corosync on the machine (the one, *not* in the cluster
4.  Try to start corosync on one (or more) of the 3 nodes.

Actual results:
In /var/log/messages

Oct 16 13:22:22 ospha3.cloud.lab.eng.bos.redhat.com corosync[14942]: [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
Oct 16 13:22:24 ospha3.cloud.lab.eng.bos.redhat.com corosync[14942]: [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
Oct 16 13:22:25 ospha3.cloud.lab.eng.bos.redhat.com corosync[14942]: [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
Oct 16 13:22:27 ospha3.cloud.lab.eng.bos.redhat.com corosync[14942]: [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.

Expected results:
Cluster should start properly and ignore (may log an error message so we know bad packets are getting sent from somewhere).

Comment 9 Jan Friesse 2016-01-25 16:08:23 UTC
Chrissie,
do you think we can make this bug into 7.3? If so, please set devel_ack, otherwise just move it to 7.4 (7.3 is mostly about qdevice anyway).

Comment 16 Jan Friesse 2019-06-24 17:05:44 UTC
Created attachment 1584104 [details]
udpu: Drop packets from unlisted IPs

udpu: Drop packets from unlisted IPs

This feature allows corosync to block packets received from unknown
nodes (nodes with IP address which is not in the nodelist). This is
mainly for situations when "forgotten" node is booted and tries to join
cluster which already removed such node from configuration. Another use
case is to allow atomic reconfiguration and rejoin of two separate
clusters.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 17 Jan Friesse 2019-06-24 17:06:00 UTC
Created attachment 1584105 [details]
man: Enahnce block_unlisted_ips description

man: Enahnce block_unlisted_ips description

Thanks Christine Caulfield <ccaulfie> for
Englishify and refining the description.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>
(cherry picked from commit d775f1425d6ebbfa25c7ba43c0fc69902507a8d6)

Comment 18 Jan Friesse 2019-07-15 14:32:58 UTC
For QA, the way I've tested the patch.

1. Create cluster and remove one of the nodes from the cluster without changing the corosync config file of the node. So nodelist on nodes with updated config looks like:

```
    nodelist {
        node {
            nodeid: 1
            name: node1
            ring0_addr: node1_ip
        }
        node {
            nodeid: 2
            name: node2
            ring0_addr: node2_ip
        }

...
```

and node with not updated config:
```
    nodelist {
        node {
            nodeid: 1
            name: node1
            ring0_addr: node1_ip
        }
        node {
            nodeid: 2
            name: node2
            ring0_addr: node2_ip
        }
        node {
            nodeid: 3
            name: node3
            ring0_addr: node3_ip
        }

...
```

2. Start cluster. On nodes with updated config following messages are logged (debug has to be turned on):

```
DATE TIME debug   [TOTEM ] Packet rejected from node3_ip
```

and node without updated config creates stable (no moving back and forth between gather/commit/operational state) single node membership.

3. Set totem.block_unlisted_ips to "no" value and retest. Cluster should behave same way as without patch = move between gather/operational state.

Comment 21 Patrik Hagara 2019-11-28 16:09:09 UTC
reproducer steps used:
  * set a password for hacluster user on all nodes: `passwd hacluster`
  * start pcsd service on all nodes: `systemctl start pcsd`
  * authenticate nodes against each other: `pcs cluster auth node1 node2 node3`
  * setup the cluster (from one node only): `pcs cluster setup --name test node1 node2 node3`
  * on nodes 1 and 2, remove node3 from nodelist in /etc/corosync/corosync.conf
  * set "debug: on" in the logging section of /etc/corosync/corosync.conf on all nodes
  * start corosync on all nodes: `systemctl start corosync`
  * watch /var/log/cluster/corosync.log on all nodes

before fix (corosync-2.4.3-6.el7):
==============

  * node3 keeps trying to form a cluster membership with node1 and node2, the logs show nothing suspicious (no messages repeating every N seconds, it's quiet)
  * nodes 1 and 2 keep trying to integrate node3 into the cluster every time they receive a packet from node3, with the following messages being repeated every few seconds:
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] Creating commit token because I am the rep.
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] Saving state aru a high seq received a
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [MAIN  ] Storing new sequence id for ring 13c
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] entering COMMIT state.
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] got commit token
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] entering RECOVERY state.
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] TRANS [0] member 10.37.166.196:
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] TRANS [1] member 10.37.166.200:
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] position [0] member 10.37.166.196:
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] previous ring seq 138 rep 10.37.166.196
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] aru a high delivered a received flag 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] position [1] member 10.37.166.200:
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] previous ring seq 138 rep 10.37.166.196
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] aru a high delivered a received flag 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] Did not need to originate any messages in recovery.
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] got commit token
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] Sending initial ORF token
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] install seq 0 aru 0 high seq received 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] install seq 0 aru 0 high seq received 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] install seq 0 aru 0 high seq received 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] install seq 0 aru 0 high seq received 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] Resetting old ring state
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] recovery to regular 1-0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] waiting_trans_ack changed to 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] entering OPERATIONAL state.
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncnotice  [TOTEM ] A new membership (10.37.166.196:316) was formed. Members
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [SYNC  ] Committing synchronization for corosync configuration map access
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [CMAP  ] Not first sync -> no action
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [CPG   ] comparing: sender r(0) ip(10.37.166.196) ; members(old:2 left:0)
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [CPG   ] comparing: sender r(0) ip(10.37.166.200) ; members(old:2 left:0)
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [CPG   ] chosen downlist: sender r(0) ip(10.37.166.196) ; members(old:2 left:0)
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [SYNC  ] Committing synchronization for corosync cluster closed process group service v1.01
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] Sending nodelist callback. ring_id = 1/316
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] got nodeinfo message from cluster node 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 2 flags: 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] total_votes=2, expected_votes=2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] node 1 state=1, votes=1, expected=2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] node 2 state=1, votes=1, expected=2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] lowest node id: 1 us: 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] highest node id: 2 us: 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] got nodeinfo message from cluster node 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] got nodeinfo message from cluster node 2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 2 flags: 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] got nodeinfo message from cluster node 2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [SYNC  ] Committing synchronization for corosync vote quorum service v1.0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] total_votes=2, expected_votes=2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] node 1 state=1, votes=1, expected=2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] node 2 state=1, votes=1, expected=2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] lowest node id: 1 us: 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] highest node id: 2 us: 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncnotice  [QUORUM] Members[2]: 1 2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [QUORUM] sending quorum notification to (nil), length = 56
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] Sending quorum callback, quorate = 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncnotice  [MAIN  ] Completed service synchronization, ready to provide service.
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] waiting_trans_ack changed to 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] entering GATHER state from 9(merge during operational state).
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] entering GATHER state from 0(consensus timeout).


after fix (corosync-2.4.5-4.el7):
=================================

  * node3 forms a stable single-node corosync cluster membership
  * nodes 1 and 2 have corosync log spammed with the following message when debug is turned on:
> [24769] virt-122.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] Packet rejected from 10.37.167.7
  * re-testing with "block_unlisted_ips: no" inside the totem section of /etc/corosync/corosync.conf reverts to the pre-fix behavior

Marking verified in corosync-2.4.5-4.el7.

Comment 23 errata-xmlrpc 2020-03-31 19:54:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1079