Bug 1372616 - galera can not recover after network issue
Summary: galera can not recover after network issue
Keywords:
Status: CLOSED DUPLICATE of bug 1251525
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: mariadb-galera
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: async
: 10.0 (Newton)
Assignee: Damien Ciabrini
QA Contact: Shai Revivo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-02 08:33 UTC by Faiaz Ahmed
Modified: 2019-12-16 06:39 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-10-12 14:14:42 UTC
Target Upstream Version:


Attachments (Terms of Use)

Comment 4 Damien Ciabrini 2016-09-26 08:36:39 UTC
A network issue caused nodes from being partition from each other, losing
quorum and consequently pacemaker shut nodes down.

Investigation showed that at the time of the incident, the size of the
database was abnormally large, probably due to the fact that some
expired data like keystone tokens were not flushed periodically.

This caused galera nodes to timeout when they needed to synchronize
state with SST (full rsync) before joining the cluster. After two
consecutive restarts, rsync would succeeded and node would reconnect.

There was one node which experienced a db corruption, logs show that
there were attempts at starting galera via systemd concurrrently with
pacemaker, this may be the cause of the corruption. Eventually, the
node wouldn't restart automatically via pacemaker because manual
updates to the the mariadb config were not reverted and prevented full
restart of the node.

Cleaning up expired data from the DB regularly ensures that its size
is kept under control and will prevent further start timeouts when SST
is needed.

Comment 5 Damien Ciabrini 2016-09-26 08:43:59 UTC
As an additional note to comment #4, the reason why the galera DB grew to
unexpected size is because there was no cron jobs for automatic cleanup
of the OpenStack DB.

Newer versions of OSP ship with the required croned jobs, so upgrade should
be considered.

Comment 6 Damien Ciabrini 2016-10-12 14:14:42 UTC
A longer-term solution of this bug is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1251525.

*** This bug has been marked as a duplicate of bug 1251525 ***


Note You need to log in before you can comment on or make changes to this bug.