Description of problem:Need to determin why Galera cluster fails to sync. Looking at the logs I see that galera1 becomes partitioned and the other 2 nodes are in another partition. I see an error in the logs for Galera1: 160911 5:17:49 [ERROR] WSREP: async IST sender failed to serve tcp://10.137.130.42:4568: ist send failed: 1', asio error 'Connection reset by peer': 104 (Connection reset by peer) at galera/src/ist.cpp:send():769 160911 5:17:49 [ERROR] WSREP: gcs/src/gcs.c:_join():800: Sending JOIN failed: -103 (Software caused connection abort). terminate called after throwing an instance of 'gu::Exception' what(): gcs_join(-104) failed: 103 (Software caused connection abort) at galera/src/gcs.hpp:join():177 160911 5:17:49 [ERROR] mysqld got signal 6 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. Version-Release number of selected component (if applicable): Server version: 5.5.42-MariaDB-wsrep How reproducible: unknown Steps to Reproduce: 1. pcs cluster standby d4-ucos-galera2 pcs cluster unstandby d4-ucos-galera2 2. Master/Slave Set: galera-master [galera] galera (ocf::heartbeat:galera) : FAILED MASTER d4-ucos-galera3 ( unmanaged ) Masters: [ d4-ucos-galera1 d4-ucos-galera2 ] 3. Actual results: galera cluster is partitioned and needed pcs cluster stop/start to connect Expected results: galera syncs together Additional info: 10.137.130.40 galera 1 93828457-77de-11e6-b669-2aea46675e16 (tcp://10.137.130.40:4567) 10.137.130.41 galera 2 b635c324-77e7-11e6-8af1-83dd0e8988e1, 'tcp://0.0.0.0:4567') address 'tcp://10.137.130.41:4567 10.137.130.42 galera 3 b2831ef2-77f4-11e6-89d8-2af9e6d5b4e0, 'tcp://0.0.0.0:4567') address 'tcp://10.137.130.42:4567
When galera nodes need to join with SST, they're transferring the entire database via rsync. This transfer is taking longer than the default 300s promotion timeout configured for the galera resource in pacemaker. Per resource configuration, this makes pacemaker stop managing galera on the node, i.e. if still monitors it but doesn't restart it anymore. After the pacemaker timeout, the SST finishes eventually and the galera node is started and running. Only, pacemaker doesn't manage it anymore at this time. If network partition occurs while galera resource is unmanaged in pacemaker, pacemaker will not restart the cluster automatically. It is possible that SST are taking a long time because some tables in the DB are larger than expected, for example if expired keystone tokens are not flushed periodically. I would advise to double check that and flush if needed to reduce the DB size. If DB size is really expected, then one would need to raise the promote timeout of the galera resource in pacemaker.
A longer-term solution of this bug is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1251525. *** This bug has been marked as a duplicate of bug 1251525 ***