Bug 1394209 - Galera nodes get demoted and promoted without apparent reason
Summary: Galera nodes get demoted and promoted without apparent reason
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: resource-agents
Version: 7.3
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: rc
: ---
Assignee: Damien Ciabrini
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-11 12:05 UTC by Tomas Jamrisko
Modified: 2021-01-15 07:28 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-15 07:28:32 UTC
Target Upstream Version:


Attachments (Terms of Use)
report from the controller in charge (7.66 MB, application/x-gzip)
2016-11-11 12:05 UTC, Tomas Jamrisko
no flags Details
Second controller (7.21 MB, application/x-gzip)
2016-11-11 12:06 UTC, Tomas Jamrisko
no flags Details
Last controller (7.08 MB, application/x-gzip)
2016-11-11 12:07 UTC, Tomas Jamrisko
no flags Details

Description Tomas Jamrisko 2016-11-11 12:05:33 UTC
Created attachment 1219746 [details]
report from the controller in charge

Description of problem:
Pacemaker demotes and promotes all galera nodes, without an apparent reason basically at the same time causing a temporary sql unavailability (and thus pretty much breaking the whole cluster)

How reproducible:
random

Comment 1 Tomas Jamrisko 2016-11-11 12:06:27 UTC
Created attachment 1219747 [details]
Second controller

Comment 2 Tomas Jamrisko 2016-11-11 12:07:16 UTC
Created attachment 1219748 [details]
Last controller

Comment 4 Andrew Beekhof 2016-11-14 00:08:01 UTC
Meta - In future, please include sos reports.  Custom tarballs often omit important information required for debugging.


What makes you say there was no apparent reason?

I see:

Nov 10 19:50:11 [1956] controller-0.localdomain       crmd:     info: process_lrm_event:	Result of monitor operation for galera on controller-0: 1 (unknown error) | call=30 key=galera_monitor_10000 confirmed=false cib-update=102
Nov 10 19:50:11 [1956] controller-0.localdomain       crmd:   notice: process_lrm_event:	controller-0-galera_monitor_10000:30 [ ocf-exit-reason:local node <controller-0> is started, but not in primary mode. Unknown state.\n ]
Nov 10 19:50:11 [1956] controller-0.localdomain       crmd:     info: abort_transition_graph:	Transition aborted by operation galera_monitor_10000 'create' on controller-0: Old event | magic=0:1;35:11:8:bf2b79b6-8726-46d1-b804-fcf20ebe7e0a cib=0.35.8 source=process_graph_event:498 complete=true

which prompts pacemaker to initiate recovery:

Nov 10 19:50:12 [1955] controller-0.localdomain    pengine:     info: LogActions: Leave   galera:0	(Master controller-2) 
Nov 10 19:50:12 [1955] controller-0.localdomain    pengine:     info: LogActions: Leave   galera:1	(Master controller-1)
Nov 10 19:50:12 [1955] controller-0.localdomain    pengine:   notice: LogActions: Recover galera:2	(Master controller-0)

That recovery ultimately fails too:

Nov 10 19:50:16 [1956] controller-0.localdomain       crmd:   notice: process_lrm_event:	Result of demote operation for galera on controller-0: 0 (ok) | call=78 key=galera_demote_0 confirmed=true cib-update=106
Nov 10 19:50:16 [1956] controller-0.localdomain       crmd:   notice: process_lrm_event:	Result of stop operation for galera on controller-0: 0 (ok) | call=79 key=galera_stop_0 confirmed=true cib-update=108
Nov 10 19:50:16 [1956] controller-0.localdomain       crmd:   notice: process_lrm_event:	Result of start operation for galera on controller-0: 0 (ok) | call=80 key=galera_start_0 confirmed=true cib-update=110
Nov 10 19:50:51 [1956] controller-0.localdomain       crmd:   notice: process_lrm_event:	Result of promote operation for galera on controller-0: 1 (unknown error) | call=81 key=galera_promote_0 confirmed=true cib-update=112
Nov 10 19:50:51 [1956] controller-0.localdomain       crmd:   notice: process_lrm_event:	controller-0-galera_promote_0:81 [ ocf-exit-reason:MySQL server failed to start (pid=28899) (rc=0), please check your installation\n ]

Meanwhile the process repeats on controller-1:

Nov 10 19:50:19 [1827] controller-1.localdomain       crmd:     info: process_lrm_event:	Result of monitor operation for galera on controller-1: 1 (unknown error) | call=26 key=galera_monitor_10000 confirmed=false cib-update=44
Nov 10 19:50:19 [1827] controller-1.localdomain       crmd:   notice: process_lrm_event:	controller-1-galera_monitor_10000:26 [ ocf-exit-reason:local node <controller-1> is started, but not in primary mode. Unknown state.\n ]

Nov 10 19:50:56 [1827] controller-1.localdomain       crmd:   notice: process_lrm_event:	Result of demote operation for galera on controller-1: 0 (ok) | call=76 key=galera_demote_0 confirmed=true cib-update=45
Nov 10 19:51:00 [1827] controller-1.localdomain       crmd:   notice: process_lrm_event:	Result of stop operation for galera on controller-1: 0 (ok) | call=77 key=galera_stop_0 confirmed=true cib-update=46
Nov 10 19:51:02 [1827] controller-1.localdomain       crmd:   notice: process_lrm_event:	Result of start operation for galera on controller-1: 0 (ok) | call=78 key=galera_start_0 confirmed=true cib-update=47
Nov 10 19:51:02 [1827] controller-1.localdomain       crmd:     info: process_lrm_event:	Result of monitor operation for galera on controller-1: 0 (ok) | call=79 key=galera_monitor_30000 confirmed=false cib-update=48
Nov 10 19:51:02 [1827] controller-1.localdomain       crmd:     info: process_lrm_event:	Result of monitor operation for galera on controller-1: 0 (ok) | call=80 key=galera_monitor_20000 confirmed=false cib-update=49

And again on controller-2:

Nov 10 19:50:20 [1800] controller-2.localdomain       crmd:     info: process_lrm_event:	Result of monitor operation for galera on controller-2: 1 (unknown error) | call=28 key=galera_monitor_10000 confirmed=false cib-update=46
Nov 10 19:50:20 [1800] controller-2.localdomain       crmd:   notice: process_lrm_event:	controller-2-galera_monitor_10000:28 [ ocf-exit-reason:local node <controller-2> is started, but not in primary mode. Unknown state.\n ]

Nov 10 19:51:00 [1800] controller-2.localdomain       crmd:   notice: process_lrm_event:	Result of demote operation for galera on controller-2: 0 (ok) | call=77 key=galera_demote_0 confirmed=true cib-update=47
Nov 10 19:51:01 [1800] controller-2.localdomain       crmd:   notice: process_lrm_event:	Result of stop operation for galera on controller-2: 0 (ok) | call=78 key=galera_stop_0 confirmed=true cib-update=48
Nov 10 19:51:01 [1800] controller-2.localdomain       crmd:   notice: process_lrm_event:	Result of start operation for galera on controller-2: 0 (ok) | call=79 key=galera_start_0 confirmed=true cib-update=49

Except we don't try to promote it because the majority of its peers are borked.


Passing on to Damien to investigate

Comment 5 Damien Ciabrini 2016-11-16 19:49:47 UTC
Most likely the galera cluster was impacted by a temporary network disruption

Right after 19:50:00, a network "hiccup" made the galera nodes to be
unreachable from each others for too long, which make them consider
being in a non-Primary partition. This is an error condition for the
galera resource agent.

Now the very unlikely race kicked in: the first monitor operation to
notice this error state happened on controller-0. So pacemaker
triggered a restart of this resource. But at the time it restarted the
server, the other two clone resources were not restarted yet.

Those running galera servers (still in non-Primary state) noticed that
controller-0 has changed identity (i.e. not the same process) since
they last saw it before the network issue, and thus they refused to
remerge into a single 3-node partition.

Eventually controller-0 timed out in the wsrep join step, which make
the pacemaker resource go unmanaged, as expected from the galera
resource config (op promote on-fail=block).

When the pending monitor status returned for the other two resources,
pacemaker initiated a restart of those resources, but couldn't determine
a node to bootstrap galera from, because controller-0 was unmanaged.

It is unclear what caused the network disruption at this time. I can see
from the logs that rabbitmq was also impact, whereas corosync was not.

Comment 7 RHEL Program Management 2021-01-15 07:28:32 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.


Note You need to log in before you can comment on or make changes to this bug.