Bug 1376084 - Cluster becomes partitioned and galera1 node fails to join back.
Summary: Cluster becomes partitioned and galera1 node fails to join back.
Keywords:
Status: CLOSED DUPLICATE of bug 1251525
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: galera
Version: 6.0 (Juno)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 6.0 (Juno)
Assignee: Damien Ciabrini
QA Contact: Shai Revivo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-14 15:52 UTC by Jeremy
Modified: 2019-12-16 06:47 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-10-12 14:16:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jeremy 2016-09-14 15:52:28 UTC
Description of problem:Need to determin why Galera cluster fails to sync. Looking at the logs I see that galera1 becomes partitioned and the other 2 nodes are in another partition.  
I see an error in the logs for Galera1:

160911  5:17:49 [ERROR] WSREP: async IST sender failed to serve tcp://10.137.130.42:4568: ist send failed: 1', asio error 'Connection reset by peer': 104 (Connection reset by peer)
         at galera/src/ist.cpp:send():769
160911  5:17:49 [ERROR] WSREP: gcs/src/gcs.c:_join():800: Sending JOIN failed: -103 (Software caused connection abort).
terminate called after throwing an instance of 'gu::Exception'
  what():  gcs_join(-104) failed: 103 (Software caused connection abort)
         at galera/src/gcs.hpp:join():177
160911  5:17:49 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

Version-Release number of selected component (if applicable):
Server version: 5.5.42-MariaDB-wsrep


How reproducible:
unknown

Steps to Reproduce:
1. pcs cluster standby d4-ucos-galera2 
 pcs cluster unstandby d4-ucos-galera2 
2. Master/Slave Set: galera-master [galera]
    galera  (ocf::heartbeat:galera) : FAILED MASTER d4-ucos-galera3 ( unmanaged )
     Masters: [ d4-ucos-galera1 d4-ucos-galera2 ]
3.

Actual results:
galera cluster is partitioned and needed pcs cluster stop/start to connect

Expected results:
galera syncs together

Additional info:

10.137.130.40 galera 1  93828457-77de-11e6-b669-2aea46675e16 (tcp://10.137.130.40:4567)
10.137.130.41 galera 2  b635c324-77e7-11e6-8af1-83dd0e8988e1, 'tcp://0.0.0.0:4567') address 'tcp://10.137.130.41:4567
10.137.130.42 galera 3  b2831ef2-77f4-11e6-89d8-2af9e6d5b4e0, 'tcp://0.0.0.0:4567') address 'tcp://10.137.130.42:4567

Comment 7 Damien Ciabrini 2016-09-15 09:52:25 UTC
When galera nodes need to join with SST, they're transferring the
entire database via rsync.  This transfer is taking longer than the
default 300s promotion timeout configured for the galera resource in
pacemaker.  Per resource configuration, this makes pacemaker
stop managing galera on the node, i.e. if still monitors it but
doesn't restart it anymore.

After the pacemaker timeout, the SST finishes eventually and the
galera node is started and running. Only, pacemaker doesn't manage it
anymore at this time.

If network partition occurs while galera resource is unmanaged in
pacemaker, pacemaker will not restart the cluster automatically.

It is possible that SST are taking a long time because some tables in
the DB are larger than expected, for example if expired keystone
tokens are not flushed periodically. I would advise to double check
that and flush if needed to reduce the DB size.

If DB size is really expected, then one would need to raise the promote
timeout of the galera resource in pacemaker.

Comment 8 Damien Ciabrini 2016-10-12 14:16:20 UTC
A longer-term solution of this bug is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1251525.

*** This bug has been marked as a duplicate of bug 1251525 ***


Note You need to log in before you can comment on or make changes to this bug.