Hide Forgot
A network issue caused nodes from being partition from each other, losing quorum and consequently pacemaker shut nodes down. Investigation showed that at the time of the incident, the size of the database was abnormally large, probably due to the fact that some expired data like keystone tokens were not flushed periodically. This caused galera nodes to timeout when they needed to synchronize state with SST (full rsync) before joining the cluster. After two consecutive restarts, rsync would succeeded and node would reconnect. There was one node which experienced a db corruption, logs show that there were attempts at starting galera via systemd concurrrently with pacemaker, this may be the cause of the corruption. Eventually, the node wouldn't restart automatically via pacemaker because manual updates to the the mariadb config were not reverted and prevented full restart of the node. Cleaning up expired data from the DB regularly ensures that its size is kept under control and will prevent further start timeouts when SST is needed.
As an additional note to comment #4, the reason why the galera DB grew to unexpected size is because there was no cron jobs for automatic cleanup of the OpenStack DB. Newer versions of OSP ship with the required croned jobs, so upgrade should be considered.
A longer-term solution of this bug is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1251525. *** This bug has been marked as a duplicate of bug 1251525 ***