1410474 – HA | Galera fails to start after losing one of the controllers.

Bug 1410474 - HA | Galera fails to start after losing one of the controllers.

Summary: HA | Galera fails to start after losing one of the controllers.

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	mariadb-galera
Sub Component:
Version:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Damien Ciabrini
QA Contact:	Udi Shkalim
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-01-05 15:24 UTC by Rodrigo A B Freire
Modified:	2020-02-14 18:31 UTC (History)
CC List:	24 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-01-05 16:58:49 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Rodrigo A B Freire 2017-01-05 15:24:27 UTC

Description of problem:
galera-master fails to start or manually promoted to master if the former master node went offline

Version-Release number of selected component (if applicable):
RHOSP7, resource-agents-3.9.5-82.el7_3.3

How reproducible:
Uncertain

Steps to Reproduce:
1. Have a 3-node HA environment running
2. Kill (shutdown abruptly) the node that is the Galera master

Actual results:
Can not promote to master or start Galera-master in other nodes

Expected results:
Should be able to start or promote to master other Galera nodes.

Additional info:

Comment 1 Michael Bayer 2017-01-05 15:43:31 UTC

In a Galera cluster, there is no single node that is the "galera master", all nodes are masters.  So when one controller is powered off, the remaining nodes just continue running normally.  So it's not clear what's actually observed.

We will at least require SOS reports from all three controller nodes in order to begin diagnosing what was seen.

Comment 2 Damien Ciabrini 2017-01-05 16:58:21 UTC

The resource agent could not restart the stopped galera cluster automatically because one of the controller node was offline.

Reading logs from customer ticket, a mistake was done while following the procedure to bootstrap the cluster manually: the pcs command to restart the first galera node has not been executed on the node selected for bootstrap. The resource agent prevented the restart accordingly.

After replaying the manual restart procedure, the cluster went up as expected.

Comment 3 Rodrigo A B Freire 2017-01-16 12:26:46 UTC

Removing needinfo?, as C#2 was pretty elucidative.

Note You need to log in before you can comment on or make changes to this bug.

abeekhof
agk
bschmaus
cluster-maint
dciabrin
dmaley
ebarrera
fdinitto
jherrman
jruemker
jschluet
kgaillot
lnatapov
mbayer
mburns
mcornea
mschuppe
oblaut
pablo.iranzo
rfreire
royoung
sknauss
srevivo
ushkalim