1153818 – Cluster fails to start when nodes that aren't in the cluster, think they are and are running corosync

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1153818 - Cluster fails to start when nodes that aren't in the cluster, think they are and are running corosync

Summary: Cluster fails to start when nodes that aren't in the cluster, think they are ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	corosync
Sub Component:
Version:	7.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Jan Friesse
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1205796
TreeView+	depends on / blocked

Reported:	2014-10-16 21:20 UTC by Chris Feist
Modified:	2020-03-31 19:54 UTC (History)
CC List:	7 users (show)
Fixed In Version:	corosync-2.4.5-1.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-03-31 19:54:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
udpu: Drop packets from unlisted IPs (7.74 KB, patch) 2019-06-24 17:05 UTC, Jan Friesse	no flags	Details \| Diff
man: Enahnce block_unlisted_ips description (1.24 KB, patch) 2019-06-24 17:06 UTC, Jan Friesse	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:1079	0	None	None	None	2020-03-31 19:54:37 UTC

Description Chris Feist 2014-10-16 21:20:36 UTC

Description of problem:
Cluster fails to start when nodes that aren't in the cluster, think they are and are running corosync

Version-Release number of selected component (if applicable):
corosync-2.3.3-2.el7.x86_64

How reproducible:
Always (as far as I can tel)

Steps to Reproduce:
1.  Create a cluster of 3 nodes (but don't start it)
2.  Take the corosync.conf from one of the nodes and stick it on another machine
3.  Start corosync on the machine (the one, *not* in the cluster
4.  Try to start corosync on one (or more) of the 3 nodes.

Actual results:
In /var/log/messages

Oct 16 13:22:22 ospha3.cloud.lab.eng.bos.redhat.com corosync[14942]: [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
Oct 16 13:22:24 ospha3.cloud.lab.eng.bos.redhat.com corosync[14942]: [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
Oct 16 13:22:25 ospha3.cloud.lab.eng.bos.redhat.com corosync[14942]: [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
Oct 16 13:22:27 ospha3.cloud.lab.eng.bos.redhat.com corosync[14942]: [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.

Expected results:
Cluster should start properly and ignore (may log an error message so we know bad packets are getting sent from somewhere).

Comment 9 Jan Friesse 2016-01-25 16:08:23 UTC

Chrissie,
do you think we can make this bug into 7.3? If so, please set devel_ack, otherwise just move it to 7.4 (7.3 is mostly about qdevice anyway).

Comment 16 Jan Friesse 2019-06-24 17:05:44 UTC

Created attachment 1584104 [details]
udpu: Drop packets from unlisted IPs

udpu: Drop packets from unlisted IPs

This feature allows corosync to block packets received from unknown
nodes (nodes with IP address which is not in the nodelist). This is
mainly for situations when "forgotten" node is booted and tries to join
cluster which already removed such node from configuration. Another use
case is to allow atomic reconfiguration and rejoin of two separate
clusters.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 17 Jan Friesse 2019-06-24 17:06:00 UTC

Created attachment 1584105 [details]
man: Enahnce block_unlisted_ips description

man: Enahnce block_unlisted_ips description

Thanks Christine Caulfield <ccaulfie> for
Englishify and refining the description.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>
(cherry picked from commit d775f1425d6ebbfa25c7ba43c0fc69902507a8d6)

Comment 18 Jan Friesse 2019-07-15 14:32:58 UTC

For QA, the way I've tested the patch.

1. Create cluster and remove one of the nodes from the cluster without changing the corosync config file of the node. So nodelist on nodes with updated config looks like:

```
    nodelist {
        node {
            nodeid: 1
            name: node1
            ring0_addr: node1_ip
        }
        node {
            nodeid: 2
            name: node2
            ring0_addr: node2_ip
        }

...
```

and node with not updated config:
```
    nodelist {
        node {
            nodeid: 1
            name: node1
            ring0_addr: node1_ip
        }
        node {
            nodeid: 2
            name: node2
            ring0_addr: node2_ip
        }
        node {
            nodeid: 3
            name: node3
            ring0_addr: node3_ip
        }

...
```

2. Start cluster. On nodes with updated config following messages are logged (debug has to be turned on):

```
DATE TIME debug   [TOTEM ] Packet rejected from node3_ip
```

and node without updated config creates stable (no moving back and forth between gather/commit/operational state) single node membership.

3. Set totem.block_unlisted_ips to "no" value and retest. Cluster should behave same way as without patch = move between gather/operational state.

Comment 21 Patrik Hagara 2019-11-28 16:09:09 UTC

reproducer steps used:
  * set a password for hacluster user on all nodes: `passwd hacluster`
  * start pcsd service on all nodes: `systemctl start pcsd`
  * authenticate nodes against each other: `pcs cluster auth node1 node2 node3`
  * setup the cluster (from one node only): `pcs cluster setup --name test node1 node2 node3`
  * on nodes 1 and 2, remove node3 from nodelist in /etc/corosync/corosync.conf
  * set "debug: on" in the logging section of /etc/corosync/corosync.conf on all nodes
  * start corosync on all nodes: `systemctl start corosync`
  * watch /var/log/cluster/corosync.log on all nodes

before fix (corosync-2.4.3-6.el7):
==============

  * node3 keeps trying to form a cluster membership with node1 and node2, the logs show nothing suspicious (no messages repeating every N seconds, it's quiet)
  * nodes 1 and 2 keep trying to integrate node3 into the cluster every time they receive a packet from node3, with the following messages being repeated every few seconds:
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] Creating commit token because I am the rep.
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] Saving state aru a high seq received a
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [MAIN  ] Storing new sequence id for ring 13c
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] entering COMMIT state.
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] got commit token
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] entering RECOVERY state.
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] TRANS [0] member 10.37.166.196:
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] TRANS [1] member 10.37.166.200:
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] position [0] member 10.37.166.196:
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] previous ring seq 138 rep 10.37.166.196
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] aru a high delivered a received flag 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] position [1] member 10.37.166.200:
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] previous ring seq 138 rep 10.37.166.196
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] aru a high delivered a received flag 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] Did not need to originate any messages in recovery.
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] got commit token
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] Sending initial ORF token
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] install seq 0 aru 0 high seq received 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] install seq 0 aru 0 high seq received 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] install seq 0 aru 0 high seq received 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] install seq 0 aru 0 high seq received 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] Resetting old ring state
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] recovery to regular 1-0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] waiting_trans_ack changed to 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] entering OPERATIONAL state.
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncnotice  [TOTEM ] A new membership (10.37.166.196:316) was formed. Members
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [SYNC  ] Committing synchronization for corosync configuration map access
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [CMAP  ] Not first sync -> no action
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [CPG   ] comparing: sender r(0) ip(10.37.166.196) ; members(old:2 left:0)
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [CPG   ] comparing: sender r(0) ip(10.37.166.200) ; members(old:2 left:0)
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [CPG   ] chosen downlist: sender r(0) ip(10.37.166.196) ; members(old:2 left:0)
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [SYNC  ] Committing synchronization for corosync cluster closed process group service v1.01
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] Sending nodelist callback. ring_id = 1/316
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] got nodeinfo message from cluster node 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 2 flags: 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] total_votes=2, expected_votes=2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] node 1 state=1, votes=1, expected=2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] node 2 state=1, votes=1, expected=2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] lowest node id: 1 us: 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] highest node id: 2 us: 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] got nodeinfo message from cluster node 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] got nodeinfo message from cluster node 2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 2 flags: 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] got nodeinfo message from cluster node 2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [SYNC  ] Committing synchronization for corosync vote quorum service v1.0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] total_votes=2, expected_votes=2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] node 1 state=1, votes=1, expected=2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] node 2 state=1, votes=1, expected=2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] lowest node id: 1 us: 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] highest node id: 2 us: 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncnotice  [QUORUM] Members[2]: 1 2
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [QUORUM] sending quorum notification to (nil), length = 56
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [VOTEQ ] Sending quorum callback, quorate = 1
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncnotice  [MAIN  ] Completed service synchronization, ready to provide service.
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] waiting_trans_ack changed to 0
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] entering GATHER state from 9(merge during operational state).
> [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] entering GATHER state from 0(consensus timeout).


after fix (corosync-2.4.5-4.el7):
=================================

  * node3 forms a stable single-node corosync cluster membership
  * nodes 1 and 2 have corosync log spammed with the following message when debug is turned on:
> [24769] virt-122.cluster-qe.lab.eng.brq.redhat.com corosyncdebug   [TOTEM ] Packet rejected from 10.37.167.7
  * re-testing with "block_unlisted_ips: no" inside the totem section of /etc/corosync/corosync.conf reverts to the pre-fix behavior

Marking verified in corosync-2.4.5-4.el7.

Comment 23 errata-xmlrpc 2020-03-31 19:54:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1079

Note You need to log in before you can comment on or make changes to this bug.