Bug 1153818

Summary:

Cluster fails to start when nodes that aren't in the cluster, think they are and are running corosync

Product:

Red Hat Enterprise Linux 7

Reporter:

Chris Feist <cfeist>

Component:

corosync

Assignee:

Jan Friesse <jfriesse>

Status:

CLOSED ERRATA

QA Contact:

cluster-qe <cluster-qe>

Severity:

medium

Docs Contact:

Priority:

high

Version:

7.2

CC:

ccaulfie, cfeist, cluster-maint, jfriesse, jkortus, mnovacek, phagara

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

corosync-2.4.5-1.el7

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-03-31 19:54:26 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1205796

Attachments:

Description	Flags
udpu: Drop packets from unlisted IPs	none
man: Enahnce block_unlisted_ips description	none

Description Chris Feist 2014-10-16 21:20:36 UTC

Description of problem:
Cluster fails to start when nodes that aren't in the cluster, think they are and are running corosync

Version-Release number of selected component (if applicable):
corosync-2.3.3-2.el7.x86_64

How reproducible:
Always (as far as I can tel)

Steps to Reproduce:
1.  Create a cluster of 3 nodes (but don't start it)
2.  Take the corosync.conf from one of the nodes and stick it on another machine
3.  Start corosync on the machine (the one, *not* in the cluster
4.  Try to start corosync on one (or more) of the 3 nodes.

Actual results:
In /var/log/messages

Oct 16 13:22:22 ospha3.cloud.lab.eng.bos.redhat.com corosync[14942]: [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
Oct 16 13:22:24 ospha3.cloud.lab.eng.bos.redhat.com corosync[14942]: [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
Oct 16 13:22:25 ospha3.cloud.lab.eng.bos.redhat.com corosync[14942]: [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
Oct 16 13:22:27 ospha3.cloud.lab.eng.bos.redhat.com corosync[14942]: [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.

Expected results:
Cluster should start properly and ignore (may log an error message so we know bad packets are getting sent from somewhere).

Comment 9 Jan Friesse 2016-01-25 16:08:23 UTC

Chrissie,
do you think we can make this bug into 7.3? If so, please set devel_ack, otherwise just move it to 7.4 (7.3 is mostly about qdevice anyway).

Comment 16 Jan Friesse 2019-06-24 17:05:44 UTC

Created attachment 1584104 [details]
udpu: Drop packets from unlisted IPs

udpu: Drop packets from unlisted IPs

This feature allows corosync to block packets received from unknown
nodes (nodes with IP address which is not in the nodelist). This is
mainly for situations when "forgotten" node is booted and tries to join
cluster which already removed such node from configuration. Another use
case is to allow atomic reconfiguration and rejoin of two separate
clusters.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 17 Jan Friesse 2019-06-24 17:06:00 UTC

Created attachment 1584105 [details]
man: Enahnce block_unlisted_ips description

man: Enahnce block_unlisted_ips description

Thanks Christine Caulfield <ccaulfie> for
Englishify and refining the description.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>
(cherry picked from commit d775f1425d6ebbfa25c7ba43c0fc69902507a8d6)

Comment 18 Jan Friesse 2019-07-15 14:32:58 UTC

For QA, the way I've tested the patch.

1. Create cluster and remove one of the nodes from the cluster without changing the corosync config file of the node. So nodelist on nodes with updated config looks like:

```
    nodelist {
        node {
            nodeid: 1
            name: node1
            ring0_addr: node1_ip
        }
        node {
            nodeid: 2
            name: node2
            ring0_addr: node2_ip
        }

...
```

and node with not updated config:
```
    nodelist {
        node {
            nodeid: 1
            name: node1
            ring0_addr: node1_ip
        }
        node {
            nodeid: 2
            name: node2
            ring0_addr: node2_ip
        }
        node {
            nodeid: 3
            name: node3
            ring0_addr: node3_ip
        }

...
```

2. Start cluster. On nodes with updated config following messages are logged (debug has to be turned on):

```
DATE TIME debug   [TOTEM ] Packet rejected from node3_ip
```

and node without updated config creates stable (no moving back and forth between gather/commit/operational state) single node membership.

3. Set totem.block_unlisted_ips to "no" value and retest. Cluster should behave same way as without patch = move between gather/operational state.

Comment 21 Patrik Hagara 2019-11-28 16:09:09 UTC

reproducer steps used:
  * set a password for hacluster user on all nodes: `passwd hacluster`
  * start pcsd service on all nodes: `systemctl start pcsd`
  * authenticate nodes against each other: `pcs cluster auth node1 node2 node3`
  * setup the cluster (from one node only): `pcs cluster setup --name test node1 node2 node3`
  * on nodes 1 and 2, remove node3 from nodelist in /etc/corosync/corosync.conf
  * set "debug: on" in the logging section of /etc/corosync/corosync.conf on all nodes
  * start corosync on all nodes: `systemctl start corosync`
  * watch /var/log/cluster/corosync.log on all nodes

before fix (corosync-2.4.3-6.el7):
==============

  * node3 keeps trying to form a cluster membership with node1 and node2, the logs show nothing suspicious (no messages repeating every N seconds, it's quiet)
  * nodes 1 and 2 keep trying to integrate node3 into the cluster every time they receive a packet from node3, with the following messages being repeated every few seconds:
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] Creating commit token because I am the rep.
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] Saving state aru a high seq received a
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [MAIN  ] Storing new sequence id for ring 13c
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] entering COMMIT state.
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] got commit token
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] entering RECOVERY state.
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] TRANS [0] member 10.37.166.196:
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] TRANS [1] member 10.37.166.200:
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] position [0] member 10.37.166.196:
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] previous ring seq 138 rep 10.37.166.196
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] aru a high delivered a received flag 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] position [1] member 10.37.166.200:
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] previous ring seq 138 rep 10.37.166.196
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] aru a high delivered a received flag 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] Did not need to originate any messages in recovery.
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] got commit token
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] Sending initial ORF token
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] install seq 0 aru 0 high seq received 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] install seq 0 aru 0 high seq received 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] install seq 0 aru 0 high seq received 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] install seq 0 aru 0 high seq received 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] Resetting old ring state
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] recovery to regular 1-0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] waiting_trans_ack changed to 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] entering OPERATIONAL state.
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncnotice  [TOTEM ] A new membership (10.37.166.196:316) was formed. Members
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [SYNC  ] Committing synchronization for corosync configuration map access
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [CMAP  ] Not first sync -> no action
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [CPG   ] comparing: sender r(0) ip(10.37.166.196) ; members(old:2 left:0)
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [CPG   ] comparing: sender r(0) ip(10.37.166.200) ; members(old:2 left:0)
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [CPG   ] chosen downlist: sender r(0) ip(10.37.166.196) ; members(old:2 left:0)
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [SYNC  ] Committing synchronization for corosync cluster closed process group service v1.01
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] Sending nodelist callback. ring_id = 1/316
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] got nodeinfo message from cluster node 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 2 flags: 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] total_votes=2, expected_votes=2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] node 1 state=1, votes=1, expected=2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] node 2 state=1, votes=1, expected=2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] lowest node id: 1 us: 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] highest node id: 2 us: 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] got nodeinfo message from cluster node 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] got nodeinfo message from cluster node 2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 2 flags: 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] got nodeinfo message from cluster node 2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [SYNC  ] Committing synchronization for corosync vote quorum service v1.0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] total_votes=2, expected_votes=2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] node 1 state=1, votes=1, expected=2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] node 2 state=1, votes=1, expected=2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] lowest node id: 1 us: 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] highest node id: 2 us: 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncnotice  [QUORUM] Members[2]: 1 2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [QUORUM] sending quorum notification to (nil), length = 56
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] Sending quorum callback, quorate = 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncnotice  [MAIN  ] Completed service synchronization, ready to provide service.
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] waiting_trans_ack changed to 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] entering GATHER state from 9(merge during operational state).
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] entering GATHER state from 0(consensus timeout).


after fix (corosync-2.4.5-4.el7):
=================================

  * node3 forms a stable single-node corosync cluster membership
  * nodes 1 and 2 have corosync log spammed with the following message when debug is turned on:
> [24769] virt-122.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] Packet rejected from 10.37.167.7
  * re-testing with "block_unlisted_ips: no" inside the totem section of /etc/corosync/corosync.conf reverts to the pre-fix behavior

Marking verified in corosync-2.4.5-4.el7.

Comment 23 errata-xmlrpc 2020-03-31 19:54:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1079