Description of problem: galera-master fails to start or manually promoted to master if the former master node went offline Version-Release number of selected component (if applicable): RHOSP7, resource-agents-3.9.5-82.el7_3.3 How reproducible: Uncertain Steps to Reproduce: 1. Have a 3-node HA environment running 2. Kill (shutdown abruptly) the node that is the Galera master Actual results: Can not promote to master or start Galera-master in other nodes Expected results: Should be able to start or promote to master other Galera nodes. Additional info:
In a Galera cluster, there is no single node that is the "galera master", all nodes are masters. So when one controller is powered off, the remaining nodes just continue running normally. So it's not clear what's actually observed. We will at least require SOS reports from all three controller nodes in order to begin diagnosing what was seen.
The resource agent could not restart the stopped galera cluster automatically because one of the controller node was offline. Reading logs from customer ticket, a mistake was done while following the procedure to bootstrap the cluster manually: the pcs command to restart the first galera node has not been executed on the node selected for bootstrap. The resource agent prevented the restart accordingly. After replaying the manual restart procedure, the cluster went up as expected.
Removing needinfo?, as C#2 was pretty elucidative.