Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1394209

Summary: Galera nodes get demoted and promoted without apparent reason
Product: Red Hat Enterprise Linux 7 Reporter: Tomas Jamrisko <tjamrisk>
Component: resource-agentsAssignee: Damien Ciabrini <dciabrin>
Status: CLOSED WONTFIX QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.3CC: agk, cluster-maint, fdinitto, mkrcmari, oalbrigt
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-15 07:28:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
report from the controller in charge
none
Second controller
none
Last controller none

Description Tomas Jamrisko 2016-11-11 12:05:33 UTC
Created attachment 1219746 [details]
report from the controller in charge

Description of problem:
Pacemaker demotes and promotes all galera nodes, without an apparent reason basically at the same time causing a temporary sql unavailability (and thus pretty much breaking the whole cluster)

How reproducible:
random

Comment 1 Tomas Jamrisko 2016-11-11 12:06:27 UTC
Created attachment 1219747 [details]
Second controller

Comment 2 Tomas Jamrisko 2016-11-11 12:07:16 UTC
Created attachment 1219748 [details]
Last controller

Comment 4 Andrew Beekhof 2016-11-14 00:08:01 UTC
Meta - In future, please include sos reports.  Custom tarballs often omit important information required for debugging.


What makes you say there was no apparent reason?

I see:

Nov 10 19:50:11 [1956] controller-0.localdomain       crmd:     info: process_lrm_event:	Result of monitor operation for galera on controller-0: 1 (unknown error) | call=30 key=galera_monitor_10000 confirmed=false cib-update=102
Nov 10 19:50:11 [1956] controller-0.localdomain       crmd:   notice: process_lrm_event:	controller-0-galera_monitor_10000:30 [ ocf-exit-reason:local node <controller-0> is started, but not in primary mode. Unknown state.\n ]
Nov 10 19:50:11 [1956] controller-0.localdomain       crmd:     info: abort_transition_graph:	Transition aborted by operation galera_monitor_10000 'create' on controller-0: Old event | magic=0:1;35:11:8:bf2b79b6-8726-46d1-b804-fcf20ebe7e0a cib=0.35.8 source=process_graph_event:498 complete=true

which prompts pacemaker to initiate recovery:

Nov 10 19:50:12 [1955] controller-0.localdomain    pengine:     info: LogActions: Leave   galera:0	(Master controller-2) 
Nov 10 19:50:12 [1955] controller-0.localdomain    pengine:     info: LogActions: Leave   galera:1	(Master controller-1)
Nov 10 19:50:12 [1955] controller-0.localdomain    pengine:   notice: LogActions: Recover galera:2	(Master controller-0)

That recovery ultimately fails too:

Nov 10 19:50:16 [1956] controller-0.localdomain       crmd:   notice: process_lrm_event:	Result of demote operation for galera on controller-0: 0 (ok) | call=78 key=galera_demote_0 confirmed=true cib-update=106
Nov 10 19:50:16 [1956] controller-0.localdomain       crmd:   notice: process_lrm_event:	Result of stop operation for galera on controller-0: 0 (ok) | call=79 key=galera_stop_0 confirmed=true cib-update=108
Nov 10 19:50:16 [1956] controller-0.localdomain       crmd:   notice: process_lrm_event:	Result of start operation for galera on controller-0: 0 (ok) | call=80 key=galera_start_0 confirmed=true cib-update=110
Nov 10 19:50:51 [1956] controller-0.localdomain       crmd:   notice: process_lrm_event:	Result of promote operation for galera on controller-0: 1 (unknown error) | call=81 key=galera_promote_0 confirmed=true cib-update=112
Nov 10 19:50:51 [1956] controller-0.localdomain       crmd:   notice: process_lrm_event:	controller-0-galera_promote_0:81 [ ocf-exit-reason:MySQL server failed to start (pid=28899) (rc=0), please check your installation\n ]

Meanwhile the process repeats on controller-1:

Nov 10 19:50:19 [1827] controller-1.localdomain       crmd:     info: process_lrm_event:	Result of monitor operation for galera on controller-1: 1 (unknown error) | call=26 key=galera_monitor_10000 confirmed=false cib-update=44
Nov 10 19:50:19 [1827] controller-1.localdomain       crmd:   notice: process_lrm_event:	controller-1-galera_monitor_10000:26 [ ocf-exit-reason:local node <controller-1> is started, but not in primary mode. Unknown state.\n ]

Nov 10 19:50:56 [1827] controller-1.localdomain       crmd:   notice: process_lrm_event:	Result of demote operation for galera on controller-1: 0 (ok) | call=76 key=galera_demote_0 confirmed=true cib-update=45
Nov 10 19:51:00 [1827] controller-1.localdomain       crmd:   notice: process_lrm_event:	Result of stop operation for galera on controller-1: 0 (ok) | call=77 key=galera_stop_0 confirmed=true cib-update=46
Nov 10 19:51:02 [1827] controller-1.localdomain       crmd:   notice: process_lrm_event:	Result of start operation for galera on controller-1: 0 (ok) | call=78 key=galera_start_0 confirmed=true cib-update=47
Nov 10 19:51:02 [1827] controller-1.localdomain       crmd:     info: process_lrm_event:	Result of monitor operation for galera on controller-1: 0 (ok) | call=79 key=galera_monitor_30000 confirmed=false cib-update=48
Nov 10 19:51:02 [1827] controller-1.localdomain       crmd:     info: process_lrm_event:	Result of monitor operation for galera on controller-1: 0 (ok) | call=80 key=galera_monitor_20000 confirmed=false cib-update=49

And again on controller-2:

Nov 10 19:50:20 [1800] controller-2.localdomain       crmd:     info: process_lrm_event:	Result of monitor operation for galera on controller-2: 1 (unknown error) | call=28 key=galera_monitor_10000 confirmed=false cib-update=46
Nov 10 19:50:20 [1800] controller-2.localdomain       crmd:   notice: process_lrm_event:	controller-2-galera_monitor_10000:28 [ ocf-exit-reason:local node <controller-2> is started, but not in primary mode. Unknown state.\n ]

Nov 10 19:51:00 [1800] controller-2.localdomain       crmd:   notice: process_lrm_event:	Result of demote operation for galera on controller-2: 0 (ok) | call=77 key=galera_demote_0 confirmed=true cib-update=47
Nov 10 19:51:01 [1800] controller-2.localdomain       crmd:   notice: process_lrm_event:	Result of stop operation for galera on controller-2: 0 (ok) | call=78 key=galera_stop_0 confirmed=true cib-update=48
Nov 10 19:51:01 [1800] controller-2.localdomain       crmd:   notice: process_lrm_event:	Result of start operation for galera on controller-2: 0 (ok) | call=79 key=galera_start_0 confirmed=true cib-update=49

Except we don't try to promote it because the majority of its peers are borked.


Passing on to Damien to investigate

Comment 5 Damien Ciabrini 2016-11-16 19:49:47 UTC
Most likely the galera cluster was impacted by a temporary network disruption

Right after 19:50:00, a network "hiccup" made the galera nodes to be
unreachable from each others for too long, which make them consider
being in a non-Primary partition. This is an error condition for the
galera resource agent.

Now the very unlikely race kicked in: the first monitor operation to
notice this error state happened on controller-0. So pacemaker
triggered a restart of this resource. But at the time it restarted the
server, the other two clone resources were not restarted yet.

Those running galera servers (still in non-Primary state) noticed that
controller-0 has changed identity (i.e. not the same process) since
they last saw it before the network issue, and thus they refused to
remerge into a single 3-node partition.

Eventually controller-0 timed out in the wsrep join step, which make
the pacemaker resource go unmanaged, as expected from the galera
resource config (op promote on-fail=block).

When the pending monitor status returned for the other two resources,
pacemaker initiated a restart of those resources, but couldn't determine
a node to bootstrap galera from, because controller-0 was unmanaged.

It is unclear what caused the network disruption at this time. I can see
from the logs that rabbitmq was also impact, whereas corosync was not.

Comment 7 RHEL Program Management 2021-01-15 07:28:32 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.