Bug 1891855

Summary:	galera cannot recover from a network split on a 2-node cluster
Product:	Red Hat Enterprise Linux 8	Reporter:	Damien Ciabrini <dciabrin>
Component:	resource-agents	Assignee:	Oyvind Albrigtsen <oalbrigt>
Status:	CLOSED ERRATA	QA Contact:	cluster-qe <cluster-qe>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	8.2	CC:	agk, cluster-maint, fdinitto, pkomarov
Target Milestone:	rc	Flags:	pm-rhel: mirror+
Target Release:	8.4
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	resource-agents-4.1.1-72.el8	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-05-18 15:11:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Damien Ciabrini 2020-10-27 14:19:25 UTC

Description of problem:
Galera maintains its own quorum, and when a network split
occurs in a two node cluster, both node becomes inquorate.

The resource agent always demotes a node when it loses
galera quorum; however it cannot promote it back because
it waits for the other node to advertise its DB sequence
number in the CIB, and that information is unavailable
during the network split.

So the resource agent is currently not able to take any
automatic decision to restart the resource, even after
pacemaker has fenced the other node and determined it's the
surviving node in the cluster.
 
Version-Release number of selected component (if applicable):
resource-agents-4.1.1-44.el8_2.3.x86_64

How reproducible:
Always

Steps to Reproduce:
1. deploy a galera cluster in a two node cluster
2. force a network disruption between the two node
3. witness both galera node shutdown

Actual results:
no side of the cluster can restart galera even after the other side has been fenced

Expected results:
automatic recovery should take place

Additional info:

Comment 5 Damien Ciabrini 2020-12-04 08:42:35 UTC

Steps to verify the fix:

Pre-requisite:
. Deploy a 2-node overcloud with fencing enabled (fencing is mandatory). A standard infrared config scaled down to 2-node is ideal.
. The fencing configuration is assumed to be configured to restart a node, not stop it (this should be the default configuration).

Once the overcloud is deployed, reconfigure the galear resource to enable the two-node behaviour:

pcs resource disable galera-bundle (wait for galera to be stopped)
pcs resource update galera two_node_mode=true
pcs resource enable galera-bundle



Test 1: Planned reboot, remaining node not impacted

Reboot one node cleanly, and verify that on the other node galera stays active.
After the node rebooted, it should rejoin the cluster automatically and galera should be running on the two nodes.


Test 2: Node crash, remaining node recovers service

. Crash nodeA on the hypervisor. e.g.:
ssh 192.168.24.42 "sudo bash -c 'echo b>/proc/sysrq-trigger'"

. nodeB should notices that nodeA is gone, verify that nodeB fences nodeA

Jul 29 07:33:09 controller-1 pacemaker-fenced[3464]:  notice: Requesting peer fencing (off) of controller-0
Jul 29 07:33:09 controller-1 pacemaker-fenced[3464]:  notice: stonith-fence_ipmilan-525400aaa3dc is eligible to fence (off) controller-0: static-list
Jul 29 07:33:13 controller-1 pacemaker-fenced[3464]:  notice: Operation 'off' [1025403] (call 16 from pacemaker-controld.3472) for host 'controller-0' with device 'stonith-fence_ipmilan-525400aaa3dc' returned: 0 (OK)
Jul 29 07:33:13 controller-1 pacemaker-fenced[3464]:  notice: Call to stonith-fence_ipmilan-525400aaa3dc for 'controller-0 off' on behalf of pacemaker-controld.3472@controller-1: OK (0)

. Verify that nodeB keeps quorum
Jul 29 07:33:06 controller-1 corosync[3263]:   [KNET  ] link: host: 1 link: 0 is down
Jul 29 07:33:08 controller-1 corosync[3263]:   [QUORUM] Members[1]: 2
Jul 29 07:33:08 controller-1 corosync[3263]:   [MAIN  ] Completed service synchronization, ready to provide service.

. Galera has its own quorum, so temporarily this cause it to stop. But Once fencing is confirmed, pacemaker can restart galera on the surviving node. Verify that galera is restarted:

Jul 29 07:33:19 controller-1 galera(galera)[1026778]: INFO: Galera instance wsrep_cluster_status=non-Primary
Jul 29 07:33:28 controller-1 galera(galera)[1027942]: INFO: MySQL stopped
Jul 29 07:33:36 controller-1 galera(galera)[1028666]: WARNING: Survived a split in a 2-node cluster, considering ourselves safe to bootstrap
Jul 29 07:33:36 controller-1 galera(galera)[1028673]: INFO: Node <controller-1> is marked as safe to bootstrap.
Jul 29 07:33:37 controller-1 galera(galera)[1028686]: INFO: Promoting controller-1 to be our bootstrap node
Jul 29 07:33:51 controller-1 galera(galera)[1030889]: INFO: Galera started


Test 3: Restart galera when a single node is running

. Stop pacemaker on nodeA. Verify that galera is still running on the nodeB

. Restart the galera resource on nodeB. Since pacemaker still has quorum on nodeB, galera will restart succesfully.


Test 4: Protection from split-brain: make pacemaker lose quorum on the remaining node

. Stop nodeA with 'pcs cluster stop'

. once nodeA is stopped, stop nodeB with 'pcs cluster stop --force'

. restart nodeB with 'pcs cluster start'.

At this point there's no quorum in pacemaker because nodeA is still down, so pacemaker will refuse to restart galera. verify that quorum is lost:
[root@standalone-1 ~]# corosync-quorumtool  | grep Quorate
Quorate:          No

Comment 6 pkomarov 2020-12-06 22:25:22 UTC

Verified , 
*thanks Damien for the help on this:)

[stack@undercloud-0 ~]$ ansible controller -b -mshell -a 'rpm -q pacemaker'
controller-1 | CHANGED | rc=0 >>
pacemaker-2.0.5-3.el8.x86_64
controller-0 | CHANGED | rc=0 >>
pacemaker-2.0.5-3.el8.x86_64

[stack@undercloud-0 ~]$  ansible controller -b -mshell -a'podman exec `podman ps -f name=galera-bundle -q`  sh -c "rpm -q resource-agents";rpm -q resource-agents'

controller-1 | CHANGED | rc=0 >>
resource-agents-4.1.1-79.el8.x86_64
resource-agents-4.1.1-79.el8.x86_64
controller-0 | CHANGED | rc=0 >>
resource-agents-4.1.1-79.el8.x86_64
resource-agents-4.1.1-79.el8.x86_64


[stack@undercloud-0 ~]$ ansible controller-0 -b -mshell -a'echo b>/proc/sysrq-trigger'


Dec 06 21:51:17 controller-1 pacemaker-schedulerd[515182]:  warning: Cluster node controller-0 will be fenced: peer is no longer part of the cluster
Dec 06 21:51:17 controller-1 pacemaker-schedulerd[515182]:  warning: Node controller-0 is unclean
Dec 06 21:51:17 controller-1 pacemaker-controld[515183]:  notice: Requesting fencing (reboot) of node controller-0

#galera resource running on controller-1 saw its peer disappearing as well, so it decided to shut off temporarily. From the journal:
Dec 06 21:51:23 controller-1 galera(galera)[313712]: INFO: Galera instance wsrep_cluster_status=non-Primary
Dec 06 21:51:23 controller-1 galera(galera)[313717]: ERROR: local node <controller-1> is started, but not in primary mode. Unknown state.
Dec 06 21:51:35 controller-1 galera(galera)[315905]: INFO: MySQL stopped

(this stop logs an error in pcs status. it's not fatal though, it's just a log that tells that the server has unexpectedly stopped)

And once the fencing got confirmed by the IPMI port:
Dec 06 21:51:25 controller-1 pacemaker-fenced[515179]:  notice: Operation 'reboot' targeting controller-0 on controller-1 for pacemaker-controld.515183: OK

Pacemaker could restart the galera resource on controller-1, because at this point pacemaker has the guarantee that it has kept quorum.

Dec 06 21:51:39 controller-1 pacemaker-controld[515183]:  notice: Initiating start operation galera_start_0 locally on galera-bundle-1

And on restart, rather than staying in "Slave" mode (not started) until controller-0 comes back online, this galera node detects that it's running on the partition that still has quorum:

Dec 06 21:51:45 controller-1 galera(galera)[316840]: WARNING: Survived a split in a 2-node cluster, considering ourselves safe to bootstrap
Dec 06 21:51:45 controller-1 galera(galera)[316844]: INFO: Node <controller-1> is marked as safe to bootstrap.
Dec 06 21:51:45 controller-1 galera(galera)[316851]: INFO: Promoting controller-1 to be our bootstrap node

So the galera resource decides that it's safe to start galera on controller-1

Dec 06 21:52:00 controller-1 galera(galera)[319082]: INFO: Bootstrap complete, promoting the rest of the galera instances.
Dec 06 21:52:00 controller-1 galera(galera)[319086]: INFO: Galera started

[..]
    * galera-bundle-0   (ocf::heartbeat:galera):         Master controller-0
    * galera-bundle-1   (ocf::heartbeat:galera):         Master controller-1

Comment 8 errata-xmlrpc 2021-05-18 15:11:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (resource-agents bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1736