Bug 1891855
| Summary: | galera cannot recover from a network split on a 2-node cluster | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Damien Ciabrini <dciabrin> |
| Component: | resource-agents | Assignee: | Oyvind Albrigtsen <oalbrigt> |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 8.2 | CC: | agk, cluster-maint, fdinitto, pkomarov |
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
| Target Release: | 8.4 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | resource-agents-4.1.1-72.el8 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-05-18 15:11:15 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Damien Ciabrini
2020-10-27 14:19:25 UTC
Steps to verify the fix: Pre-requisite: . Deploy a 2-node overcloud with fencing enabled (fencing is mandatory). A standard infrared config scaled down to 2-node is ideal. . The fencing configuration is assumed to be configured to restart a node, not stop it (this should be the default configuration). Once the overcloud is deployed, reconfigure the galear resource to enable the two-node behaviour: pcs resource disable galera-bundle (wait for galera to be stopped) pcs resource update galera two_node_mode=true pcs resource enable galera-bundle Test 1: Planned reboot, remaining node not impacted Reboot one node cleanly, and verify that on the other node galera stays active. After the node rebooted, it should rejoin the cluster automatically and galera should be running on the two nodes. Test 2: Node crash, remaining node recovers service . Crash nodeA on the hypervisor. e.g.: ssh 192.168.24.42 "sudo bash -c 'echo b>/proc/sysrq-trigger'" . nodeB should notices that nodeA is gone, verify that nodeB fences nodeA Jul 29 07:33:09 controller-1 pacemaker-fenced[3464]: notice: Requesting peer fencing (off) of controller-0 Jul 29 07:33:09 controller-1 pacemaker-fenced[3464]: notice: stonith-fence_ipmilan-525400aaa3dc is eligible to fence (off) controller-0: static-list Jul 29 07:33:13 controller-1 pacemaker-fenced[3464]: notice: Operation 'off' [1025403] (call 16 from pacemaker-controld.3472) for host 'controller-0' with device 'stonith-fence_ipmilan-525400aaa3dc' returned: 0 (OK) Jul 29 07:33:13 controller-1 pacemaker-fenced[3464]: notice: Call to stonith-fence_ipmilan-525400aaa3dc for 'controller-0 off' on behalf of pacemaker-controld.3472@controller-1: OK (0) . Verify that nodeB keeps quorum Jul 29 07:33:06 controller-1 corosync[3263]: [KNET ] link: host: 1 link: 0 is down Jul 29 07:33:08 controller-1 corosync[3263]: [QUORUM] Members[1]: 2 Jul 29 07:33:08 controller-1 corosync[3263]: [MAIN ] Completed service synchronization, ready to provide service. . Galera has its own quorum, so temporarily this cause it to stop. But Once fencing is confirmed, pacemaker can restart galera on the surviving node. Verify that galera is restarted: Jul 29 07:33:19 controller-1 galera(galera)[1026778]: INFO: Galera instance wsrep_cluster_status=non-Primary Jul 29 07:33:28 controller-1 galera(galera)[1027942]: INFO: MySQL stopped Jul 29 07:33:36 controller-1 galera(galera)[1028666]: WARNING: Survived a split in a 2-node cluster, considering ourselves safe to bootstrap Jul 29 07:33:36 controller-1 galera(galera)[1028673]: INFO: Node <controller-1> is marked as safe to bootstrap. Jul 29 07:33:37 controller-1 galera(galera)[1028686]: INFO: Promoting controller-1 to be our bootstrap node Jul 29 07:33:51 controller-1 galera(galera)[1030889]: INFO: Galera started Test 3: Restart galera when a single node is running . Stop pacemaker on nodeA. Verify that galera is still running on the nodeB . Restart the galera resource on nodeB. Since pacemaker still has quorum on nodeB, galera will restart succesfully. Test 4: Protection from split-brain: make pacemaker lose quorum on the remaining node . Stop nodeA with 'pcs cluster stop' . once nodeA is stopped, stop nodeB with 'pcs cluster stop --force' . restart nodeB with 'pcs cluster start'. At this point there's no quorum in pacemaker because nodeA is still down, so pacemaker will refuse to restart galera. verify that quorum is lost: [root@standalone-1 ~]# corosync-quorumtool | grep Quorate Quorate: No Verified ,
*thanks Damien for the help on this:)
[stack@undercloud-0 ~]$ ansible controller -b -mshell -a 'rpm -q pacemaker'
controller-1 | CHANGED | rc=0 >>
pacemaker-2.0.5-3.el8.x86_64
controller-0 | CHANGED | rc=0 >>
pacemaker-2.0.5-3.el8.x86_64
[stack@undercloud-0 ~]$ ansible controller -b -mshell -a'podman exec `podman ps -f name=galera-bundle -q` sh -c "rpm -q resource-agents";rpm -q resource-agents'
controller-1 | CHANGED | rc=0 >>
resource-agents-4.1.1-79.el8.x86_64
resource-agents-4.1.1-79.el8.x86_64
controller-0 | CHANGED | rc=0 >>
resource-agents-4.1.1-79.el8.x86_64
resource-agents-4.1.1-79.el8.x86_64
[stack@undercloud-0 ~]$ ansible controller-0 -b -mshell -a'echo b>/proc/sysrq-trigger'
Dec 06 21:51:17 controller-1 pacemaker-schedulerd[515182]: warning: Cluster node controller-0 will be fenced: peer is no longer part of the cluster
Dec 06 21:51:17 controller-1 pacemaker-schedulerd[515182]: warning: Node controller-0 is unclean
Dec 06 21:51:17 controller-1 pacemaker-controld[515183]: notice: Requesting fencing (reboot) of node controller-0
#galera resource running on controller-1 saw its peer disappearing as well, so it decided to shut off temporarily. From the journal:
Dec 06 21:51:23 controller-1 galera(galera)[313712]: INFO: Galera instance wsrep_cluster_status=non-Primary
Dec 06 21:51:23 controller-1 galera(galera)[313717]: ERROR: local node <controller-1> is started, but not in primary mode. Unknown state.
Dec 06 21:51:35 controller-1 galera(galera)[315905]: INFO: MySQL stopped
(this stop logs an error in pcs status. it's not fatal though, it's just a log that tells that the server has unexpectedly stopped)
And once the fencing got confirmed by the IPMI port:
Dec 06 21:51:25 controller-1 pacemaker-fenced[515179]: notice: Operation 'reboot' targeting controller-0 on controller-1 for pacemaker-controld.515183: OK
Pacemaker could restart the galera resource on controller-1, because at this point pacemaker has the guarantee that it has kept quorum.
Dec 06 21:51:39 controller-1 pacemaker-controld[515183]: notice: Initiating start operation galera_start_0 locally on galera-bundle-1
And on restart, rather than staying in "Slave" mode (not started) until controller-0 comes back online, this galera node detects that it's running on the partition that still has quorum:
Dec 06 21:51:45 controller-1 galera(galera)[316840]: WARNING: Survived a split in a 2-node cluster, considering ourselves safe to bootstrap
Dec 06 21:51:45 controller-1 galera(galera)[316844]: INFO: Node <controller-1> is marked as safe to bootstrap.
Dec 06 21:51:45 controller-1 galera(galera)[316851]: INFO: Promoting controller-1 to be our bootstrap node
So the galera resource decides that it's safe to start galera on controller-1
Dec 06 21:52:00 controller-1 galera(galera)[319082]: INFO: Bootstrap complete, promoting the rest of the galera instances.
Dec 06 21:52:00 controller-1 galera(galera)[319086]: INFO: Galera started
[..]
* galera-bundle-0 (ocf::heartbeat:galera): Master controller-0
* galera-bundle-1 (ocf::heartbeat:galera): Master controller-1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (resource-agents bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:1736 |