Bug 1410474 - HA | Galera fails to start after losing one of the controllers.
Summary: HA | Galera fails to start after losing one of the controllers.
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: mariadb-galera
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: ---
Assignee: Damien Ciabrini
QA Contact: Udi Shkalim
Depends On:
TreeView+ depends on / blocked
Reported: 2017-01-05 15:24 UTC by Rodrigo A B Freire
Modified: 2020-02-14 18:31 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2017-01-05 16:58:49 UTC
Target Upstream Version:

Attachments (Terms of Use)

Description Rodrigo A B Freire 2017-01-05 15:24:27 UTC
Description of problem:
galera-master fails to start or manually promoted to master if the former master node went offline

Version-Release number of selected component (if applicable):
RHOSP7, resource-agents-3.9.5-82.el7_3.3

How reproducible:

Steps to Reproduce:
1. Have a 3-node HA environment running
2. Kill (shutdown abruptly) the node that is the Galera master

Actual results:
Can not promote to master or start Galera-master in other nodes

Expected results:
Should be able to start or promote to master other Galera nodes.

Additional info:

Comment 1 Michael Bayer 2017-01-05 15:43:31 UTC
In a Galera cluster, there is no single node that is the "galera master", all nodes are masters.  So when one controller is powered off, the remaining nodes just continue running normally.  So it's not clear what's actually observed.

We will at least require SOS reports from all three controller nodes in order to begin diagnosing what was seen.

Comment 2 Damien Ciabrini 2017-01-05 16:58:21 UTC
The resource agent could not restart the stopped galera cluster automatically because one of the controller node was offline.

Reading logs from customer ticket, a mistake was done while following the procedure to bootstrap the cluster manually: the pcs command to restart the first galera node has not been executed on the node selected for bootstrap. The resource agent prevented the restart accordingly.

After replaying the manual restart procedure, the cluster went up as expected.

Comment 3 Rodrigo A B Freire 2017-01-16 12:26:46 UTC
Removing needinfo?, as C#2 was pretty elucidative.

Note You need to log in before you can comment on or make changes to this bug.