Bug 1153818
Summary: | Cluster fails to start when nodes that aren't in the cluster, think they are and are running corosync | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Chris Feist <cfeist> | ||||||
Component: | corosync | Assignee: | Jan Friesse <jfriesse> | ||||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 7.2 | CC: | ccaulfie, cfeist, cluster-maint, jfriesse, jkortus, mnovacek, phagara | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | corosync-2.4.5-1.el7 | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2020-03-31 19:54:26 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1205796 | ||||||||
Attachments: |
|
Description
Chris Feist
2014-10-16 21:20:36 UTC
Chrissie, do you think we can make this bug into 7.3? If so, please set devel_ack, otherwise just move it to 7.4 (7.3 is mostly about qdevice anyway). Created attachment 1584104 [details]
udpu: Drop packets from unlisted IPs
udpu: Drop packets from unlisted IPs
This feature allows corosync to block packets received from unknown
nodes (nodes with IP address which is not in the nodelist). This is
mainly for situations when "forgotten" node is booted and tries to join
cluster which already removed such node from configuration. Another use
case is to allow atomic reconfiguration and rejoin of two separate
clusters.
Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>
Created attachment 1584105 [details]
man: Enahnce block_unlisted_ips description
man: Enahnce block_unlisted_ips description
Thanks Christine Caulfield <ccaulfie> for
Englishify and refining the description.
Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>
(cherry picked from commit d775f1425d6ebbfa25c7ba43c0fc69902507a8d6)
For QA, the way I've tested the patch. 1. Create cluster and remove one of the nodes from the cluster without changing the corosync config file of the node. So nodelist on nodes with updated config looks like: ``` nodelist { node { nodeid: 1 name: node1 ring0_addr: node1_ip } node { nodeid: 2 name: node2 ring0_addr: node2_ip } ... ``` and node with not updated config: ``` nodelist { node { nodeid: 1 name: node1 ring0_addr: node1_ip } node { nodeid: 2 name: node2 ring0_addr: node2_ip } node { nodeid: 3 name: node3 ring0_addr: node3_ip } ... ``` 2. Start cluster. On nodes with updated config following messages are logged (debug has to be turned on): ``` DATE TIME debug [TOTEM ] Packet rejected from node3_ip ``` and node without updated config creates stable (no moving back and forth between gather/commit/operational state) single node membership. 3. Set totem.block_unlisted_ips to "no" value and retest. Cluster should behave same way as without patch = move between gather/operational state. reproducer steps used: * set a password for hacluster user on all nodes: `passwd hacluster` * start pcsd service on all nodes: `systemctl start pcsd` * authenticate nodes against each other: `pcs cluster auth node1 node2 node3` * setup the cluster (from one node only): `pcs cluster setup --name test node1 node2 node3` * on nodes 1 and 2, remove node3 from nodelist in /etc/corosync/corosync.conf * set "debug: on" in the logging section of /etc/corosync/corosync.conf on all nodes * start corosync on all nodes: `systemctl start corosync` * watch /var/log/cluster/corosync.log on all nodes before fix (corosync-2.4.3-6.el7): ============== * node3 keeps trying to form a cluster membership with node1 and node2, the logs show nothing suspicious (no messages repeating every N seconds, it's quiet) * nodes 1 and 2 keep trying to integrate node3 into the cluster every time they receive a packet from node3, with the following messages being repeated every few seconds: > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] Creating commit token because I am the rep. > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] Saving state aru a high seq received a > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [MAIN ] Storing new sequence id for ring 13c > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] entering COMMIT state. > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] got commit token > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] entering RECOVERY state. > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] TRANS [0] member 10.37.166.196: > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] TRANS [1] member 10.37.166.200: > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] position [0] member 10.37.166.196: > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] previous ring seq 138 rep 10.37.166.196 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] aru a high delivered a received flag 1 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] position [1] member 10.37.166.200: > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] previous ring seq 138 rep 10.37.166.196 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] aru a high delivered a received flag 1 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] Did not need to originate any messages in recovery. > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] got commit token > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] Sending initial ORF token > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 0 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] install seq 0 aru 0 high seq received 0 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] install seq 0 aru 0 high seq received 0 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] install seq 0 aru 0 high seq received 0 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] install seq 0 aru 0 high seq received 0 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] Resetting old ring state > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] recovery to regular 1-0 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] waiting_trans_ack changed to 1 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] entering OPERATIONAL state. > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncnotice [TOTEM ] A new membership (10.37.166.196:316) was formed. Members > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [SYNC ] Committing synchronization for corosync configuration map access > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [CMAP ] Not first sync -> no action > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [CPG ] comparing: sender r(0) ip(10.37.166.196) ; members(old:2 left:0) > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [CPG ] comparing: sender r(0) ip(10.37.166.200) ; members(old:2 left:0) > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [CPG ] chosen downlist: sender r(0) ip(10.37.166.196) ; members(old:2 left:0) > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [SYNC ] Committing synchronization for corosync cluster closed process group service v1.01 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [VOTEQ ] Sending nodelist callback. ring_id = 1/316 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [VOTEQ ] got nodeinfo message from cluster node 1 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 2 flags: 1 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [VOTEQ ] total_votes=2, expected_votes=2 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [VOTEQ ] node 1 state=1, votes=1, expected=2 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [VOTEQ ] node 2 state=1, votes=1, expected=2 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [VOTEQ ] lowest node id: 1 us: 1 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [VOTEQ ] highest node id: 2 us: 1 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [VOTEQ ] got nodeinfo message from cluster node 1 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [VOTEQ ] got nodeinfo message from cluster node 2 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 2 flags: 1 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [VOTEQ ] got nodeinfo message from cluster node 2 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [SYNC ] Committing synchronization for corosync vote quorum service v1.0 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [VOTEQ ] total_votes=2, expected_votes=2 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [VOTEQ ] node 1 state=1, votes=1, expected=2 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [VOTEQ ] node 2 state=1, votes=1, expected=2 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [VOTEQ ] lowest node id: 1 us: 1 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [VOTEQ ] highest node id: 2 us: 1 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncnotice [QUORUM] Members[2]: 1 2 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [QUORUM] sending quorum notification to (nil), length = 56 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [VOTEQ ] Sending quorum callback, quorate = 1 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncnotice [MAIN ] Completed service synchronization, ready to provide service. > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] waiting_trans_ack changed to 0 > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] entering GATHER state from 9(merge during operational state). > [24805] virt-069.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] entering GATHER state from 0(consensus timeout). after fix (corosync-2.4.5-4.el7): ================================= * node3 forms a stable single-node corosync cluster membership * nodes 1 and 2 have corosync log spammed with the following message when debug is turned on: > [24769] virt-122.cluster-qe.lab.eng.brq.redhat.com corosyncdebug [TOTEM ] Packet rejected from 10.37.167.7 * re-testing with "block_unlisted_ips: no" inside the totem section of /etc/corosync/corosync.conf reverts to the pre-fix behavior Marking verified in corosync-2.4.5-4.el7. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1079 |