1891855 – galera cannot recover from a network split on a 2-node cluster

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1891855 - galera cannot recover from a network split on a 2-node cluster

Summary: galera cannot recover from a network split on a 2-node cluster

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	8.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	8.4
Assignee:	Oyvind Albrigtsen
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-27 14:19 UTC by Damien Ciabrini
Modified:	2021-05-18 15:11 UTC (History)
CC List:	4 users (show)
Fixed In Version:	resource-agents-4.1.1-72.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-18 15:11:15 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ClusterLabs resource-agents pull 1569	0	None	closed	galera: recover after network split in a 2-node cluster	2020-12-11 12:40:40 UTC

Description Damien Ciabrini 2020-10-27 14:19:25 UTC

Description of problem:
Galera maintains its own quorum, and when a network split
occurs in a two node cluster, both node becomes inquorate.

The resource agent always demotes a node when it loses
galera quorum; however it cannot promote it back because
it waits for the other node to advertise its DB sequence
number in the CIB, and that information is unavailable
during the network split.

So the resource agent is currently not able to take any
automatic decision to restart the resource, even after
pacemaker has fenced the other node and determined it's the
surviving node in the cluster.
 
Version-Release number of selected component (if applicable):
resource-agents-4.1.1-44.el8_2.3.x86_64

How reproducible:
Always

Steps to Reproduce:
1. deploy a galera cluster in a two node cluster
2. force a network disruption between the two node
3. witness both galera node shutdown

Actual results:
no side of the cluster can restart galera even after the other side has been fenced

Expected results:
automatic recovery should take place

Additional info:

Comment 5 Damien Ciabrini 2020-12-04 08:42:35 UTC

Steps to verify the fix:

Pre-requisite:
. Deploy a 2-node overcloud with fencing enabled (fencing is mandatory). A standard infrared config scaled down to 2-node is ideal.
. The fencing configuration is assumed to be configured to restart a node, not stop it (this should be the default configuration).

Once the overcloud is deployed, reconfigure the galear resource to enable the two-node behaviour:

pcs resource disable galera-bundle (wait for galera to be stopped)
pcs resource update galera two_node_mode=true
pcs resource enable galera-bundle



Test 1: Planned reboot, remaining node not impacted

Reboot one node cleanly, and verify that on the other node galera stays active.
After the node rebooted, it should rejoin the cluster automatically and galera should be running on the two nodes.


Test 2: Node crash, remaining node recovers service

. Crash nodeA on the hypervisor. e.g.:
ssh 192.168.24.42 "sudo bash -c 'echo b>/proc/sysrq-trigger'"

. nodeB should notices that nodeA is gone, verify that nodeB fences nodeA

Jul 29 07:33:09 controller-1 pacemaker-fenced[3464]:  notice: Requesting peer fencing (off) of controller-0
Jul 29 07:33:09 controller-1 pacemaker-fenced[3464]:  notice: stonith-fence_ipmilan-525400aaa3dc is eligible to fence (off) controller-0: static-list
Jul 29 07:33:13 controller-1 pacemaker-fenced[3464]:  notice: Operation 'off' [1025403] (call 16 from pacemaker-controld.3472) for host 'controller-0' with device 'stonith-fence_ipmilan-525400aaa3dc' returned: 0 (OK)
Jul 29 07:33:13 controller-1 pacemaker-fenced[3464]:  notice: Call to stonith-fence_ipmilan-525400aaa3dc for 'controller-0 off' on behalf of pacemaker-controld.3472@controller-1: OK (0)

. Verify that nodeB keeps quorum
Jul 29 07:33:06 controller-1 corosync[3263]:   [KNET  ] link: host: 1 link: 0 is down
Jul 29 07:33:08 controller-1 corosync[3263]:   [QUORUM] Members[1]: 2
Jul 29 07:33:08 controller-1 corosync[3263]:   [MAIN  ] Completed service synchronization, ready to provide service.

. Galera has its own quorum, so temporarily this cause it to stop. But Once fencing is confirmed, pacemaker can restart galera on the surviving node. Verify that galera is restarted:

Jul 29 07:33:19 controller-1 galera(galera)[1026778]: INFO: Galera instance wsrep_cluster_status=non-Primary
Jul 29 07:33:28 controller-1 galera(galera)[1027942]: INFO: MySQL stopped
Jul 29 07:33:36 controller-1 galera(galera)[1028666]: WARNING: Survived a split in a 2-node cluster, considering ourselves safe to bootstrap
Jul 29 07:33:36 controller-1 galera(galera)[1028673]: INFO: Node <controller-1> is marked as safe to bootstrap.
Jul 29 07:33:37 controller-1 galera(galera)[1028686]: INFO: Promoting controller-1 to be our bootstrap node
Jul 29 07:33:51 controller-1 galera(galera)[1030889]: INFO: Galera started


Test 3: Restart galera when a single node is running

. Stop pacemaker on nodeA. Verify that galera is still running on the nodeB

. Restart the galera resource on nodeB. Since pacemaker still has quorum on nodeB, galera will restart succesfully.


Test 4: Protection from split-brain: make pacemaker lose quorum on the remaining node

. Stop nodeA with 'pcs cluster stop'

. once nodeA is stopped, stop nodeB with 'pcs cluster stop --force'

. restart nodeB with 'pcs cluster start'.

At this point there's no quorum in pacemaker because nodeA is still down, so pacemaker will refuse to restart galera. verify that quorum is lost:
[root@standalone-1 ~]# corosync-quorumtool  | grep Quorate
Quorate:          No

Comment 6 pkomarov 2020-12-06 22:25:22 UTC

Verified , 
*thanks Damien for the help on this:)

[stack@undercloud-0 ~]$ ansible controller -b -mshell -a 'rpm -q pacemaker'
controller-1 | CHANGED | rc=0 >>
pacemaker-2.0.5-3.el8.x86_64
controller-0 | CHANGED | rc=0 >>
pacemaker-2.0.5-3.el8.x86_64

[stack@undercloud-0 ~]$  ansible controller -b -mshell -a'podman exec `podman ps -f name=galera-bundle -q`  sh -c "rpm -q resource-agents";rpm -q resource-agents'

controller-1 | CHANGED | rc=0 >>
resource-agents-4.1.1-79.el8.x86_64
resource-agents-4.1.1-79.el8.x86_64
controller-0 | CHANGED | rc=0 >>
resource-agents-4.1.1-79.el8.x86_64
resource-agents-4.1.1-79.el8.x86_64


[stack@undercloud-0 ~]$ ansible controller-0 -b -mshell -a'echo b>/proc/sysrq-trigger'


Dec 06 21:51:17 controller-1 pacemaker-schedulerd[515182]:  warning: Cluster node controller-0 will be fenced: peer is no longer part of the cluster
Dec 06 21:51:17 controller-1 pacemaker-schedulerd[515182]:  warning: Node controller-0 is unclean
Dec 06 21:51:17 controller-1 pacemaker-controld[515183]:  notice: Requesting fencing (reboot) of node controller-0

#galera resource running on controller-1 saw its peer disappearing as well, so it decided to shut off temporarily. From the journal:
Dec 06 21:51:23 controller-1 galera(galera)[313712]: INFO: Galera instance wsrep_cluster_status=non-Primary
Dec 06 21:51:23 controller-1 galera(galera)[313717]: ERROR: local node <controller-1> is started, but not in primary mode. Unknown state.
Dec 06 21:51:35 controller-1 galera(galera)[315905]: INFO: MySQL stopped

(this stop logs an error in pcs status. it's not fatal though, it's just a log that tells that the server has unexpectedly stopped)

And once the fencing got confirmed by the IPMI port:
Dec 06 21:51:25 controller-1 pacemaker-fenced[515179]:  notice: Operation 'reboot' targeting controller-0 on controller-1 for pacemaker-controld.515183: OK

Pacemaker could restart the galera resource on controller-1, because at this point pacemaker has the guarantee that it has kept quorum.

Dec 06 21:51:39 controller-1 pacemaker-controld[515183]:  notice: Initiating start operation galera_start_0 locally on galera-bundle-1

And on restart, rather than staying in "Slave" mode (not started) until controller-0 comes back online, this galera node detects that it's running on the partition that still has quorum:

Dec 06 21:51:45 controller-1 galera(galera)[316840]: WARNING: Survived a split in a 2-node cluster, considering ourselves safe to bootstrap
Dec 06 21:51:45 controller-1 galera(galera)[316844]: INFO: Node <controller-1> is marked as safe to bootstrap.
Dec 06 21:51:45 controller-1 galera(galera)[316851]: INFO: Promoting controller-1 to be our bootstrap node

So the galera resource decides that it's safe to start galera on controller-1

Dec 06 21:52:00 controller-1 galera(galera)[319082]: INFO: Bootstrap complete, promoting the rest of the galera instances.
Dec 06 21:52:00 controller-1 galera(galera)[319086]: INFO: Galera started

[..]
    * galera-bundle-0   (ocf::heartbeat:galera):         Master controller-0
    * galera-bundle-1   (ocf::heartbeat:galera):         Master controller-1

Comment 8 errata-xmlrpc 2021-05-18 15:11:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (resource-agents bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1736

Note You need to log in before you can comment on or make changes to this bug.