Bug 1372616

Summary: galera can not recover after network issue
Product: Red Hat OpenStack Reporter: Faiaz Ahmed <fahmed>
Component: mariadb-galeraAssignee: Damien Ciabrini <dciabrin>
Status: CLOSED DUPLICATE QA Contact: Shai Revivo <srevivo>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.0 (Kilo)CC: jmelvin, mbayer, mflusche, oblaut, srevivo
Target Milestone: async   
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-12 14:14:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Comment 4 Damien Ciabrini 2016-09-26 08:36:39 UTC
A network issue caused nodes from being partition from each other, losing
quorum and consequently pacemaker shut nodes down.

Investigation showed that at the time of the incident, the size of the
database was abnormally large, probably due to the fact that some
expired data like keystone tokens were not flushed periodically.

This caused galera nodes to timeout when they needed to synchronize
state with SST (full rsync) before joining the cluster. After two
consecutive restarts, rsync would succeeded and node would reconnect.

There was one node which experienced a db corruption, logs show that
there were attempts at starting galera via systemd concurrrently with
pacemaker, this may be the cause of the corruption. Eventually, the
node wouldn't restart automatically via pacemaker because manual
updates to the the mariadb config were not reverted and prevented full
restart of the node.

Cleaning up expired data from the DB regularly ensures that its size
is kept under control and will prevent further start timeouts when SST
is needed.

Comment 5 Damien Ciabrini 2016-09-26 08:43:59 UTC
As an additional note to comment #4, the reason why the galera DB grew to
unexpected size is because there was no cron jobs for automatic cleanup
of the OpenStack DB.

Newer versions of OSP ship with the required croned jobs, so upgrade should
be considered.

Comment 6 Damien Ciabrini 2016-10-12 14:14:42 UTC
A longer-term solution of this bug is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1251525.

*** This bug has been marked as a duplicate of bug 1251525 ***