| Summary: | Galera nodes get demoted and promoted without apparent reason | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Tomas Jamrisko <tjamrisk> | ||||||||
| Component: | resource-agents | Assignee: | Damien Ciabrini <dciabrin> | ||||||||
| Status: | CLOSED WONTFIX | QA Contact: | cluster-qe <cluster-qe> | ||||||||
| Severity: | medium | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 7.3 | CC: | agk, cluster-maint, fdinitto, mkrcmari, oalbrigt | ||||||||
| Target Milestone: | rc | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2021-01-15 07:28:32 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Attachments: |
|
||||||||||
Created attachment 1219747 [details]
Second controller
Created attachment 1219748 [details]
Last controller
Meta - In future, please include sos reports. Custom tarballs often omit important information required for debugging. What makes you say there was no apparent reason? I see: Nov 10 19:50:11 [1956] controller-0.localdomain crmd: info: process_lrm_event: Result of monitor operation for galera on controller-0: 1 (unknown error) | call=30 key=galera_monitor_10000 confirmed=false cib-update=102 Nov 10 19:50:11 [1956] controller-0.localdomain crmd: notice: process_lrm_event: controller-0-galera_monitor_10000:30 [ ocf-exit-reason:local node <controller-0> is started, but not in primary mode. Unknown state.\n ] Nov 10 19:50:11 [1956] controller-0.localdomain crmd: info: abort_transition_graph: Transition aborted by operation galera_monitor_10000 'create' on controller-0: Old event | magic=0:1;35:11:8:bf2b79b6-8726-46d1-b804-fcf20ebe7e0a cib=0.35.8 source=process_graph_event:498 complete=true which prompts pacemaker to initiate recovery: Nov 10 19:50:12 [1955] controller-0.localdomain pengine: info: LogActions: Leave galera:0 (Master controller-2) Nov 10 19:50:12 [1955] controller-0.localdomain pengine: info: LogActions: Leave galera:1 (Master controller-1) Nov 10 19:50:12 [1955] controller-0.localdomain pengine: notice: LogActions: Recover galera:2 (Master controller-0) That recovery ultimately fails too: Nov 10 19:50:16 [1956] controller-0.localdomain crmd: notice: process_lrm_event: Result of demote operation for galera on controller-0: 0 (ok) | call=78 key=galera_demote_0 confirmed=true cib-update=106 Nov 10 19:50:16 [1956] controller-0.localdomain crmd: notice: process_lrm_event: Result of stop operation for galera on controller-0: 0 (ok) | call=79 key=galera_stop_0 confirmed=true cib-update=108 Nov 10 19:50:16 [1956] controller-0.localdomain crmd: notice: process_lrm_event: Result of start operation for galera on controller-0: 0 (ok) | call=80 key=galera_start_0 confirmed=true cib-update=110 Nov 10 19:50:51 [1956] controller-0.localdomain crmd: notice: process_lrm_event: Result of promote operation for galera on controller-0: 1 (unknown error) | call=81 key=galera_promote_0 confirmed=true cib-update=112 Nov 10 19:50:51 [1956] controller-0.localdomain crmd: notice: process_lrm_event: controller-0-galera_promote_0:81 [ ocf-exit-reason:MySQL server failed to start (pid=28899) (rc=0), please check your installation\n ] Meanwhile the process repeats on controller-1: Nov 10 19:50:19 [1827] controller-1.localdomain crmd: info: process_lrm_event: Result of monitor operation for galera on controller-1: 1 (unknown error) | call=26 key=galera_monitor_10000 confirmed=false cib-update=44 Nov 10 19:50:19 [1827] controller-1.localdomain crmd: notice: process_lrm_event: controller-1-galera_monitor_10000:26 [ ocf-exit-reason:local node <controller-1> is started, but not in primary mode. Unknown state.\n ] Nov 10 19:50:56 [1827] controller-1.localdomain crmd: notice: process_lrm_event: Result of demote operation for galera on controller-1: 0 (ok) | call=76 key=galera_demote_0 confirmed=true cib-update=45 Nov 10 19:51:00 [1827] controller-1.localdomain crmd: notice: process_lrm_event: Result of stop operation for galera on controller-1: 0 (ok) | call=77 key=galera_stop_0 confirmed=true cib-update=46 Nov 10 19:51:02 [1827] controller-1.localdomain crmd: notice: process_lrm_event: Result of start operation for galera on controller-1: 0 (ok) | call=78 key=galera_start_0 confirmed=true cib-update=47 Nov 10 19:51:02 [1827] controller-1.localdomain crmd: info: process_lrm_event: Result of monitor operation for galera on controller-1: 0 (ok) | call=79 key=galera_monitor_30000 confirmed=false cib-update=48 Nov 10 19:51:02 [1827] controller-1.localdomain crmd: info: process_lrm_event: Result of monitor operation for galera on controller-1: 0 (ok) | call=80 key=galera_monitor_20000 confirmed=false cib-update=49 And again on controller-2: Nov 10 19:50:20 [1800] controller-2.localdomain crmd: info: process_lrm_event: Result of monitor operation for galera on controller-2: 1 (unknown error) | call=28 key=galera_monitor_10000 confirmed=false cib-update=46 Nov 10 19:50:20 [1800] controller-2.localdomain crmd: notice: process_lrm_event: controller-2-galera_monitor_10000:28 [ ocf-exit-reason:local node <controller-2> is started, but not in primary mode. Unknown state.\n ] Nov 10 19:51:00 [1800] controller-2.localdomain crmd: notice: process_lrm_event: Result of demote operation for galera on controller-2: 0 (ok) | call=77 key=galera_demote_0 confirmed=true cib-update=47 Nov 10 19:51:01 [1800] controller-2.localdomain crmd: notice: process_lrm_event: Result of stop operation for galera on controller-2: 0 (ok) | call=78 key=galera_stop_0 confirmed=true cib-update=48 Nov 10 19:51:01 [1800] controller-2.localdomain crmd: notice: process_lrm_event: Result of start operation for galera on controller-2: 0 (ok) | call=79 key=galera_start_0 confirmed=true cib-update=49 Except we don't try to promote it because the majority of its peers are borked. Passing on to Damien to investigate Most likely the galera cluster was impacted by a temporary network disruption Right after 19:50:00, a network "hiccup" made the galera nodes to be unreachable from each others for too long, which make them consider being in a non-Primary partition. This is an error condition for the galera resource agent. Now the very unlikely race kicked in: the first monitor operation to notice this error state happened on controller-0. So pacemaker triggered a restart of this resource. But at the time it restarted the server, the other two clone resources were not restarted yet. Those running galera servers (still in non-Primary state) noticed that controller-0 has changed identity (i.e. not the same process) since they last saw it before the network issue, and thus they refused to remerge into a single 3-node partition. Eventually controller-0 timed out in the wsrep join step, which make the pacemaker resource go unmanaged, as expected from the galera resource config (op promote on-fail=block). When the pending monitor status returned for the other two resources, pacemaker initiated a restart of those resources, but couldn't determine a node to bootstrap galera from, because controller-0 was unmanaged. It is unclear what caused the network disruption at this time. I can see from the logs that rabbitmq was also impact, whereas corosync was not. After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. |
Created attachment 1219746 [details] report from the controller in charge Description of problem: Pacemaker demotes and promotes all galera nodes, without an apparent reason basically at the same time causing a temporary sql unavailability (and thus pretty much breaking the whole cluster) How reproducible: random