Bug 1376084

Summary:	Cluster becomes partitioned and galera1 node fails to join back.
Product:	Red Hat OpenStack	Reporter:	Jeremy <jmelvin>
Component:	galera	Assignee:	Damien Ciabrini <dciabrin>
Status:	CLOSED DUPLICATE	QA Contact:	Shai Revivo <srevivo>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	6.0 (Juno)	CC:	dciabrin, jmelvin, mbayer, srevivo
Target Milestone:	---	Keywords:	Unconfirmed, ZStream
Target Release:	6.0 (Juno)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-10-12 14:16:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jeremy 2016-09-14 15:52:28 UTC

Description of problem:Need to determin why Galera cluster fails to sync. Looking at the logs I see that galera1 becomes partitioned and the other 2 nodes are in another partition.  
I see an error in the logs for Galera1:

160911  5:17:49 [ERROR] WSREP: async IST sender failed to serve tcp://10.137.130.42:4568: ist send failed: 1', asio error 'Connection reset by peer': 104 (Connection reset by peer)
         at galera/src/ist.cpp:send():769
160911  5:17:49 [ERROR] WSREP: gcs/src/gcs.c:_join():800: Sending JOIN failed: -103 (Software caused connection abort).
terminate called after throwing an instance of 'gu::Exception'
  what():  gcs_join(-104) failed: 103 (Software caused connection abort)
         at galera/src/gcs.hpp:join():177
160911  5:17:49 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

Version-Release number of selected component (if applicable):
Server version: 5.5.42-MariaDB-wsrep


How reproducible:
unknown

Steps to Reproduce:
1. pcs cluster standby d4-ucos-galera2 
 pcs cluster unstandby d4-ucos-galera2 
2. Master/Slave Set: galera-master [galera]
    galera  (ocf::heartbeat:galera) : FAILED MASTER d4-ucos-galera3 ( unmanaged )
     Masters: [ d4-ucos-galera1 d4-ucos-galera2 ]
3.

Actual results:
galera cluster is partitioned and needed pcs cluster stop/start to connect

Expected results:
galera syncs together

Additional info:

10.137.130.40 galera 1  93828457-77de-11e6-b669-2aea46675e16 (tcp://10.137.130.40:4567)
10.137.130.41 galera 2  b635c324-77e7-11e6-8af1-83dd0e8988e1, 'tcp://0.0.0.0:4567') address 'tcp://10.137.130.41:4567
10.137.130.42 galera 3  b2831ef2-77f4-11e6-89d8-2af9e6d5b4e0, 'tcp://0.0.0.0:4567') address 'tcp://10.137.130.42:4567

Comment 7 Damien Ciabrini 2016-09-15 09:52:25 UTC

When galera nodes need to join with SST, they're transferring the
entire database via rsync.  This transfer is taking longer than the
default 300s promotion timeout configured for the galera resource in
pacemaker.  Per resource configuration, this makes pacemaker
stop managing galera on the node, i.e. if still monitors it but
doesn't restart it anymore.

After the pacemaker timeout, the SST finishes eventually and the
galera node is started and running. Only, pacemaker doesn't manage it
anymore at this time.

If network partition occurs while galera resource is unmanaged in
pacemaker, pacemaker will not restart the cluster automatically.

It is possible that SST are taking a long time because some tables in
the DB are larger than expected, for example if expired keystone
tokens are not flushed periodically. I would advise to double check
that and flush if needed to reduce the DB size.

If DB size is really expected, then one would need to raise the promote
timeout of the galera resource in pacemaker.

Comment 8 Damien Ciabrini 2016-10-12 14:16:20 UTC

A longer-term solution of this bug is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1251525.

*** This bug has been marked as a duplicate of bug 1251525 ***