Bug 1376084
Summary: | Cluster becomes partitioned and galera1 node fails to join back. | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Jeremy <jmelvin> |
Component: | galera | Assignee: | Damien Ciabrini <dciabrin> |
Status: | CLOSED DUPLICATE | QA Contact: | Shai Revivo <srevivo> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 6.0 (Juno) | CC: | dciabrin, jmelvin, mbayer, srevivo |
Target Milestone: | --- | Keywords: | Unconfirmed, ZStream |
Target Release: | 6.0 (Juno) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-10-12 14:16:20 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jeremy
2016-09-14 15:52:28 UTC
When galera nodes need to join with SST, they're transferring the entire database via rsync. This transfer is taking longer than the default 300s promotion timeout configured for the galera resource in pacemaker. Per resource configuration, this makes pacemaker stop managing galera on the node, i.e. if still monitors it but doesn't restart it anymore. After the pacemaker timeout, the SST finishes eventually and the galera node is started and running. Only, pacemaker doesn't manage it anymore at this time. If network partition occurs while galera resource is unmanaged in pacemaker, pacemaker will not restart the cluster automatically. It is possible that SST are taking a long time because some tables in the DB are larger than expected, for example if expired keystone tokens are not flushed periodically. I would advise to double check that and flush if needed to reduce the DB size. If DB size is really expected, then one would need to raise the promote timeout of the galera resource in pacemaker. A longer-term solution of this bug is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1251525. *** This bug has been marked as a duplicate of bug 1251525 *** |