Bug 1195226 - Rebooting a single galera node causes the other 2 nodes to go into a failed state (non-operational)
Summary: Rebooting a single galera node causes the other 2 nodes to go into a failed s...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: galera
Version: 6.0 (Juno)
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: z3
: 6.0 (Juno)
Assignee: Michael Bayer
QA Contact: Asaf Hirshberg
URL:
Whiteboard:
Depends On:
Blocks: 1174326 1198800
TreeView+ depends on / blocked
 
Reported: 2015-02-23 12:25 UTC by Asaf Hirshberg
Modified: 2019-07-11 08:41 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-04-08 22:59:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
/var/log/messages (13.01 MB, text/plain)
2015-02-23 12:25 UTC, Asaf Hirshberg
no flags Details
mariadb log (22.26 KB, text/plain)
2015-02-23 12:26 UTC, Asaf Hirshberg
no flags Details
Errors from Asaf setup (30.44 KB, text/plain)
2015-02-23 12:40 UTC, Ofer Blaut
no flags Details
Errors from Leonid setup (80.78 KB, text/plain)
2015-02-23 12:41 UTC, Ofer Blaut
no flags Details
server log of a1 reproduce (3.52 MB, application/x-gzip)
2015-03-18 14:18 UTC, Asaf Hirshberg
no flags Details
mysqld.log from 192.168.0.2 (84.67 KB, text/plain)
2015-03-18 17:52 UTC, Michael Bayer
no flags Details
mysqld.log from 192.168.0.3 (77.73 KB, text/plain)
2015-03-18 17:52 UTC, Michael Bayer
no flags Details
mysqld.log from 192.168.0.4 (85.56 KB, text/plain)
2015-03-18 17:53 UTC, Michael Bayer
no flags Details
Python script to create database load (3.90 KB, text/plain)
2015-03-24 23:02 UTC, Michael Bayer
no flags Details

Description Asaf Hirshberg 2015-02-23 12:25:38 UTC
Created attachment 994359 [details]
/var/log/messages

Description of problem:
we tested the new pudlle on HA-neutron environment bare-metal machines with rally, we created a task that create and delete 20 instances a time. when i reboot a server in the middle of a task the server failed to start galera cluster on initialization process. 

150222 18:32:59 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions.

Version-Release number of selected component (if applicable):
rhel-osp-installer-client-0.5.5-5.el7ost.noarch
rhel-osp-installer-0.5.5-5.el7ost.noarch
mariadb-galera-server-5.5.40-6.el7ost.x86_64
mariadb-galera-common-5.5.40-6.el7ost.x86_64
galera-25.3.5-6.el7ost.x86_64
mariadb-5.5.41-2.el7_0.x86_64
mariadb-libs-5.5.41-2.el7_0.x86_64

How reproducible:
3/3

Steps to Reproduce:
1.create an HA-neutron setup with 3 controllers and 1 compute.
2.start a rallly task that's boot&delete 20 vms.
3.reboot a server.

Comment 5 Asaf Hirshberg 2015-02-23 12:26:14 UTC
Created attachment 994360 [details]
mariadb log

Comment 6 Ofer Blaut 2015-02-23 12:40:45 UTC
Created attachment 994364 [details]
Errors from Asaf setup

Comment 7 Ofer Blaut 2015-02-23 12:41:29 UTC
Created attachment 994365 [details]
Errors from Leonid setup

Comment 8 Ryan O'Hara 2015-02-23 14:45:07 UTC
A quick search suggests that this is the solution:

mysqld --tc-heuristic-recover=ROLLBACK

Comment 9 Ofer Blaut 2015-02-23 15:00:49 UTC
(In reply to Ryan O'Hara from comment #8)
> A quick search suggests that this is the solution:
> 
> mysqld --tc-heuristic-recover=ROLLBACK

mysqld  doesn't exist is it mysqld_safe ?

Is it relate to https://mariadb.atlassian.net/browse/MDEV-6660 ?

Comment 10 Ryan O'Hara 2015-02-23 16:20:22 UTC
(In reply to Ofer Blaut from comment #9)
> (In reply to Ryan O'Hara from comment #8)
> > A quick search suggests that this is the solution:
> > 
> > mysqld --tc-heuristic-recover=ROLLBACK
> 
> mysqld  doesn't exist is it mysqld_safe ?

mysqld does exist, it just is not in your default path. Anyway, just use mysqld_safe.

> Is it relate to https://mariadb.atlassian.net/browse/MDEV-6660 ?

Maybe.

Comment 11 Ryan O'Hara 2015-02-23 19:14:29 UTC
As far as I can tell mariadb should not and will not start until you clear up the pending transaction.

Comment 12 Ofer Blaut 2015-02-24 05:53:02 UTC
Hi Ryan

This is not HA, server might fail in the middle of transaction

Cluster must keep on working.

Comment 13 Ryan O'Hara 2015-02-24 14:14:10 UTC
(In reply to Ofer Blaut from comment #12)
> Hi Ryan
> 
> This is not HA, server might fail in the middle of transaction
> 
> Cluster must keep on working.

You should have 3 controllers, thus 3 mariadb nodes. According to the bug description you suddenly rebooted one of the controllers. It turns out that contoller was in the middle of a transaction, so when you brought the node back online it failed to join the galera cluster. Is this correct so far?

In this case you can't join the cluster because mariadb will not start until you tell it what to do about the pending transaction. This requires manual intervention. In the meantime, your other two galera nodes should be working.

The reason this should bz should be closed is because mariadb-galera, as far as I can tell, is working as expected.

Comment 14 Arthur Berezin 2015-03-11 16:30:12 UTC
Adding needinfo on Ofer to review comment #13 and close BZ if agreed.

Comment 15 Ofer Blaut 2015-03-12 14:47:23 UTC
The other two controllers were in Slave State 

and Galera state was "not synced" .

Comment 16 Perry Myers 2015-03-16 11:15:47 UTC
@mbayer: Can you please take this bug over from rohara?

As discussed on list there are multiple items at play here, but the only thing we should be concerned about in this specific bug is whether or not a reboot of a single node in the Galera cluster would cause the other two nodes to go offline.

We cannot fix (in this bug) the issue where a transaction is in flight and requires manually unblocking. Apparently, that is caused by a problem with systemd or an improper unit file not waiting for the init scripts to complete before halting the system. A separate bug should be filed by QE for that specific issue.

But for this issue, what we need to determine is if we can replicate a situation where a failure in one Galera node causes the other two to become non-operational and therefore taking the entire database offline.

Comment 17 Ryan O'Hara 2015-03-16 13:09:34 UTC
(In reply to Ofer Blaut from comment #15)
> The other two controllers were in Slave State 

Where are you seeing "slave state"?

Comment 18 Ofer Blaut 2015-03-16 16:39:34 UTC
It was seen in pcs status

Comment 19 Ryan O'Hara 2015-03-16 20:55:47 UTC
(In reply to Ofer Blaut from comment #18)
> It was seen in pcs status

OK. I don't think that applies here, but I defer to the pacemaker folks. Galera itself is multi-master.

Comment 20 Michael Bayer 2015-03-16 22:36:48 UTC
OK...aren't we typically setting things up so that all the apps are talking to *one* actual master via HAProxy?   That would be the state in which the two other nodes are working fine but the one chosen as the "master" is seen as "out of sync".  But clustercheck on that node would be reporting "not synced" for that node, and I'd assume HAProxy would move to another node, hence cluster stays up.

What is the origin of the message "not synced", is that clustercheck, and is that the state reported by all nodes or just the one that was rebooted?

Comment 21 Michael Bayer 2015-03-17 20:14:32 UTC
Asaf / Ofer - we need to confirm that in the "failed state" described in this issue, that the Galera cluster itself is failed, e.g. that connecting directly to the two other MariaDB instances cannot connect or connects to a database that is unusable, and if so, what the state of those two databases exactly is.   There's a mariadb log here but I assume that is from the node with the stuck transaction on it.   If the other two MariaDBs are usable, but the "cluster" is unavailable from a pacemaker pov, then the issue is elsewhere, e.g. pacemaker or haproxy.

Can this "failed state" be reproduced in an environment I can login towards?  That way I can just see what exactly is failed.

Comment 22 Michael Bayer 2015-03-17 20:29:31 UTC
other q:

1. /var/log/messages log shows mariadb being started by systemd.   under pacemaker, this should not occur, mariadb should be started using ocf resource agents.

2. I can't find the context for the "Found 1 prepared transactions! " error in the attached logfiles.    The errors within the logfiles seem to indicate a conflict of file access (two different systems trying to start mariadb at the same time?) as a key issue.

Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Unable to lock ./ibdata1, error: 11
Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Check that you do not already have another mysqld process
Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: using the same InnoDB data or log files.
Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Unable to lock ./ibdata1, error: 11
Feb 23 12:31:55 mac441ea1733d43 galera(galera)[29571]: INFO: MySQL not running: removing old PID file
Feb 23 12:31:55 mac441ea1733d43 galera(galera)[29571]: INFO: MySQL stopped
Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Check that you do not already have another mysqld process
Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: using the same InnoDB data or log files.
Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Unable to lock ./ibdata1, error: 11
Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Check that you do not already have another mysqld process
Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: using the same InnoDB data or log files.
Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Unable to lock ./ibdata1, error: 11

...

Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: using the same InnoDB data or log files.
Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Unable to lock ./ibdata1, error: 11
Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Check that you do not already have another mysqld process
Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: using the same InnoDB data or log files.
Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Unable to lock ./ibdata1, error: 11
Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Check that you do not already have another mysqld process
Feb 23 12:31:55 mac441ea1733d43 systemd: mariadb.service: main process exited, code=exited, status=1/FAILURE
Feb 23 12:31:55 mac441ea1733d43 systemd: Unit mariadb.service entered failed state.
Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: using the same InnoDB data or log files.
Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Unable to lock ./ibdata1, error: 11
Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Check that you do not already have another mysqld process
Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: using the same InnoDB data or log files.
Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Unable to lock ./ibdata1, error: 11
Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Check that you do not already have another mysqld process
Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: using the same InnoDB data or log files.

Comment 23 Michael Bayer 2015-03-17 20:46:50 UTC
re: --tc-heuristic-recover

1. MySQL's official documentation says the option is unused: http://dev.mysql.com/doc/refman/5.6/en/server-options.html#option_mysqld_tc-heuristic-recover, however considering percona devs have been discussing this flag, this would appear to be inaccurate?

2. the message "prepared transaction" implies that this is a two phase transaction that needs to be rolled back.  This percona article supports that notion: http://www.percona.com/blog/2014/07/15/tokudb-tips-mysql-backups/.    Question.  What part of Openstack is using two-phase transactions?  this would seem to be entirely inappropriate?  need to look into rally etc. and see what's going on.

3. same percona article states that the flag is entirely broken due to this issue, Oct 14: http://bugs.mysql.com/bug.php?id=70860, percona's issue: https://bugs.launchpad.net/percona-server/+bug/1334330

Comment 24 Michael Bayer 2015-03-17 20:58:55 UTC
codership / Galera:

https://mariadb.com/kb/en/mariadb/mariadb-galera-cluster-known-limitations/ "XA transactions are not supported.".

https://mariadb.com/kb/en/mariadb/xa-transactions/

"In MariaDB, XA transactions can only be used with storage engines that support them. At least InnoDB, TokuDB and SPIDER support them. For InnoDB, XA transactions can be disabled by setting the innodb_support_xa server system variable to 0. XA is currently not available with MariaDB Galera Cluster."

TokuDB above is what's mentioned in the Percona article where they're dealing with this flag.

Comment 25 Andrew Beekhof 2015-03-18 11:09:57 UTC
(In reply to Michael Bayer from comment #20)
> OK...aren't we typically setting things up so that all the apps are talking
> to *one* actual master via HAProxy?

Correct

>   That would be the state in which the
> two other nodes are working fine but the one chosen as the "master" is seen
> as "out of sync".  But clustercheck on that node would be reporting "not
> synced" for that node, and I'd assume HAProxy would move to another node,
> hence cluster stays up.

HAProxy would take that node out of rotation yes.
However it seems that this node going down is causing pacemaker's monitoring of galera to detect some kind of failures on the remaining nodes - hence why they appear to be in a slave state.

I see:

Feb 23 12:06:56 mac441ea1733d43 lrmd[2935]: notice: operation_finished: galera_monitor_20000:21982:stderr [ ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2) ]
Feb 23 12:06:56 mac441ea1733d43 lrmd[2935]: notice: operation_finished: galera_monitor_20000:21982:stderr [ ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2) ]
Feb 23 12:06:56 mac441ea1733d43 lrmd[2935]: notice: operation_finished: galera_monitor_20000:21982:stderr [ ocf-exit-reason:Unable to retrieve wsrep_cluster_status, verify check_user 'monitor_user' has permissions to view status ]
Feb 23 12:06:56 mac441ea1733d43 lrmd[2935]: notice: operation_finished: galera_monitor_20000:21982:stderr [ ocf-exit-reason:local node <pcmk-mac441ea1733d43> is neither in primary mode nor in read_only mode. Unknown state. ]


> What is the origin of the message "not synced", is that clustercheck, and is
> that the state reported by all nodes or just the one that was rebooted?

Comment 26 Asaf Hirshberg 2015-03-18 14:14:47 UTC
I reproduced it on A1 environment. bare-metal ha-neutron.
running rally boot&delete 20 instances as constant load, and during the task reboot one of the controllers with "reboot" command. the server start but mysql was down and output of "clustercheck" is "cluster not synced". After some time I checked the other controllers and saw that mysql is down on the oter controllers.
I am keeping the setup if someone want to debug it. Also im adding var log and mysql log from the reboot controller and another contoller in the cluster.


rpm:
foreman-installer-1.6.0-0.3.RC1.el7ost.noarch
openstack-foreman-installer-3.0.16-1.el7ost.noarch
rhel-osp-installer-0.5.5-5.el7ost.noarch
rhel-osp-installer-client-0.5.5-5.el7ost.noarch
mariadb-galera-server-5.5.40-6.el7ost.x86_64
mariadb-galera-common-5.5.40-6.el7ost.x86_64
galera-25.3.5-6.el7ost.x86_64
MySQL-python-1.2.3-11.el7.x86_64
mariadb-5.5.41-2.el7_0.x86_64
mariadb-libs-5.5.41-2.el7_0.x86_64

Comment 27 Asaf Hirshberg 2015-03-18 14:18:57 UTC
Created attachment 1003252 [details]
server log of a1 reproduce

Comment 28 Asaf Hirshberg 2015-03-18 14:30:10 UTC
ip-galera-pub-10.35.173.156	(ocf::heartbeat:IPaddr2):	Started pcmk-mac848f69fbc643 
 Master/Slave Set: galera-master [galera]
     Slaves: [ pcmk-mac848f69fbc643 ]
     Stopped: [ pcmk-mac848f69fbc49f pcmk-mac848f69fbc4c3 ]

Comment 29 Ryan O'Hara 2015-03-18 16:25:28 UTC
(In reply to Asaf Hirshberg from comment #26)
> I reproduced it on A1 environment. bare-metal ha-neutron.
> running rally boot&delete 20 instances as constant load, and during the task
> reboot one of the controllers with "reboot" command. the server start but
> mysql was down and output of "clustercheck" is "cluster not synced". After
> some time I checked the other controllers and saw that mysql is down on the
> oter controllers.

Don't rely on clustercheck. It is not a debugging tool. Its only purpose is to give an HTTP response to haproxy for healthcheck purposes. Specifically, it just tells haproxy if the node is synced and therefore available for reads/writes.

This article I wrote is outdated, but still has some valid bits:

https://www.rdoproject.org/MariaDB_Galera

If you run 'mysql -e "SHOW STATUS LIKE 'wsrep%';" you will get a full picture of what the state is. All clustercheck does is look to see if wsrep_local_state is 4.

Comment 30 Michael Bayer 2015-03-18 16:46:25 UTC
(In reply to Ryan O'Hara from comment #29)
> 
> Don't rely on clustercheck. It is not a debugging tool. Its only purpose is
> to give an HTTP response to haproxy for healthcheck purposes.

right, but clustercheck would be one reason why HAProxy wouldn't go to those nodes even if they are running fine (though sounds like they are not).    I'm going to ssh into the machine now and see what's up, specifically i want to see if systemd and resource agents are competing to start the service.

Comment 31 Michael Bayer 2015-03-18 17:52:22 UTC
Created attachment 1003350 [details]
mysqld.log from 192.168.0.2

Comment 32 Michael Bayer 2015-03-18 17:52:45 UTC
Created attachment 1003351 [details]
mysqld.log from 192.168.0.3

Comment 33 Michael Bayer 2015-03-18 17:53:07 UTC
Created attachment 1003352 [details]
mysqld.log from 192.168.0.4

Comment 34 Michael Bayer 2015-03-18 17:58:44 UTC
So I've looked in the logs, so far:

1. with the latest reproduction case, systemd is not involved at all.   The services are not part of systemd and there's nothing about systemd in the logs related to mariadb/mysql.   All the galera nodes seem to be started with the resource manager.

2. it looks like to reproduce this, there's a lot of stops/starts required as I see lots of restarts in all three nodes.

3. the 192.168.0.2 node does in fact report fully on an XA transaction that needs to be recovered.  This should not be the case, as XA transactions aren't supported with Galera.  I have three very general theories:  1. something is actually trying to do an XA transaction on the cluster, and if that occurs while the restart does, things break.   2. Galera somehow uses the XA system internally in some way, and that's what we see.  3. This is not really an XA transaction issue; the InnoDB state is corrupted in some strange way such that it *thinks* this is what it sees.    

4. The 0.3/0.4 nodes do restart, but they fail to create a cluster once the .2 node is down, and the error reported is "SSL error: SSL_CTX_set_default_verify_paths failed".   This error appears many times in all three logs on some of the startups, but once the .2 node is permanently down, it seems to persist.   


Segments of logs are below:


.2 had its outage after 14:54, and was restarting at 15:04:

150318 14:54:09 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 2636680)
150318 14:54:09 [Note] WSREP: Synchronized with group, ready for connections
150318 14:54:09 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
150318 14:54:12 [Note] WSREP: Member 1.0 (mac848f69fbc643.example.com) synced with group.
150318 15:04:05 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
150318 15:04:05 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.aR5WGz' --pid-file='/var/lib/mysql/mac848f69fbc49f.example.com-recover.pid'


in .3 we can see the node dropped correctly:



150318 14:59:40 [Note] WSREP: (fc7641ec-cd65-11e4-b054-37790c49e075, 'ssl://0.0.0.0:4567') address 'ssl://192.168.0.3:4567' pointing to uuid fc7641ec-cd65-11e4-b054-37790c49e
075 is blacklisted, skipping
150318 14:59:40 [Note] WSREP: (fc7641ec-cd65-11e4-b054-37790c49e075, 'ssl://0.0.0.0:4567') turning message relay requesting on, nonlive peers: ssl://192.168.0.2:4567 
150318 14:59:40 [Note] WSREP: (fc7641ec-cd65-11e4-b054-37790c49e075, 'ssl://0.0.0.0:4567') address 'ssl://192.168.0.3:4567' pointing to uuid fc7641ec-cd65-11e4-b054-37790c49e
075 is blacklisted, skipping
150318 14:59:41 [Note] WSREP: (fc7641ec-cd65-11e4-b054-37790c49e075, 'ssl://0.0.0.0:4567') reconnecting to 2db36c8e-cd63-11e4-aec4-369b15ae4055 (ssl://192.168.0.2:4567), atte
mpt 0
150318 14:59:42 [Note] WSREP: evs::proto(fc7641ec-cd65-11e4-b054-37790c49e075, OPERATIONAL, view_id(REG,2db36c8e-cd63-11e4-aec4-369b15ae4055,12)) suspecting node: 2db36c8e-cd
63-11e4-aec4-369b15ae4055
150318 14:59:42 [Note] WSREP: evs::proto(fc7641ec-cd65-11e4-b054-37790c49e075, GATHER, view_id(REG,2db36c8e-cd63-11e4-aec4-369b15ae4055,12)) suspecting node: 2db36c8e-cd63-11
e4-aec4-369b15ae4055
150318 14:59:43 [Note] WSREP: evs::proto(fc7641ec-cd65-11e4-b054-37790c49e075, GATHER, view_id(REG,2db36c8e-cd63-11e4-aec4-369b15ae4055,12)) suspecting node: 2db36c8e-cd63-11
e4-aec4-369b15ae4055
150318 14:59:43 [Note] WSREP: declaring 9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3 stable
150318 14:59:43 [Note] WSREP: Node 9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3 state prim
150318 14:59:43 [Note] WSREP: view(view_id(PRIM,9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3,13) memb {
        9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3,0
        fc7641ec-cd65-11e4-b054-37790c49e075,0
} joined {
} left {
} partitioned {
        2db36c8e-cd63-11e4-aec4-369b15ae4055,0
})
150318 14:59:43 [Note] WSREP: forgetting 2db36c8e-cd63-11e4-aec4-369b15ae4055 (ssl://192.168.0.2:4567)
150318 14:59:43 [Note] WSREP: deleting entry ssl://192.168.0.2:4567
150318 14:59:43 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
150318 14:59:43 [Note] WSREP: (fc7641ec-cd65-11e4-b054-37790c49e075, 'ssl://0.0.0.0:4567') address 'ssl://192.168.0.3:4567' pointing to uuid fc7641ec-cd65-11e4-b054-37790c49e
075 is blacklisted, skipping
150318 14:59:43 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
150318 14:59:43 [Note] WSREP: (fc7641ec-cd65-11e4-b054-37790c49e075, 'ssl://0.0.0.0:4567') turning message relay requesting off
150318 14:59:43 [Note] WSREP: STATE EXCHANGE: sent state msg: a55168ce-cd6e-11e4-aaed-a7966a228361
150318 14:59:43 [Note] WSREP: STATE EXCHANGE: got state msg: a55168ce-cd6e-11e4-aaed-a7966a228361 from 0 (mac848f69fbc643.example.com)
150318 14:59:43 [Note] WSREP: STATE EXCHANGE: got state msg: a55168ce-cd6e-11e4-aaed-a7966a228361 from 1 (mac848f69fbc4c3.example.com)
150318 14:59:43 [Note] WSREP: Quorum results:
        version    = 3,
        component  = PRIMARY,
        conf_id    = 12,
        members    = 2/2 (joined/total),
        act_id     = 2637851,
        last_appl. = 2637726,
        protocols  = 0/5/3 (gcs/repl/appl),
        group UUID = 68d12982-c254-11e4-a0e8-a6fbd098225f

similar on .4:

150318 14:59:43 [Note] WSREP: view(view_id(PRIM,9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3,13) memb {
        9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3,0
        fc7641ec-cd65-11e4-b054-37790c49e075,0
} joined {
} left {
} partitioned {
        2db36c8e-cd63-11e4-aec4-369b15ae4055,0
})
150318 14:59:43 [Note] WSREP: forgetting 2db36c8e-cd63-11e4-aec4-369b15ae4055 (ssl://192.168.0.2:4567)
150318 14:59:43 [Note] WSREP: deleting entry ssl://192.168.0.2:4567
150318 14:59:43 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
150318 14:59:43 [Note] WSREP: (9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3, 'ssl://0.0.0.0:4567') address 'ssl://192.168.0.4:4567' pointing to uuid 9d004ec7-cd6d-11e4-a36d-d2cd1c331
ef3 is blacklisted, skipping
150318 14:59:43 [Note] WSREP: (9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3, 'ssl://0.0.0.0:4567') turning message relay requesting off
150318 14:59:43 [Note] WSREP: STATE_EXCHANGE: sent state UUID: a55168ce-cd6e-11e4-aaed-a7966a228361
150318 14:59:43 [Note] WSREP: STATE EXCHANGE: sent state msg: a55168ce-cd6e-11e4-aaed-a7966a228361
150318 14:59:43 [Note] WSREP: STATE EXCHANGE: got state msg: a55168ce-cd6e-11e4-aaed-a7966a228361 from 0 (mac848f69fbc643.example.com)
150318 14:59:43 [Note] WSREP: STATE EXCHANGE: got state msg: a55168ce-cd6e-11e4-aaed-a7966a228361 from 1 (mac848f69fbc4c3.example.com)
150318 14:59:43 [Note] WSREP: Quorum results:
        version    = 3,
        component  = PRIMARY,
        conf_id    = 12,
        members    = 2/2 (joined/total),
        act_id     = 2637851,
        last_appl. = 2637726,
        protocols  = 0/5/3 (gcs/repl/appl),
        group UUID = 68d12982-c254-11e4-a0e8-a6fbd098225f



subsequent startups on .3 and .4 fail to set up SSL correctly, we get "SSL error: SSL_CTX_set_default_verify_paths failed" and the database starts but the cluster does not:

150318 15:42:11 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
150318 15:42:11 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.cS5mmd' --pid-file='/var/lib/mysql/mac848f69fbc643.example.com-recover.pid'
150318 15:42:11 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295
150318 15:42:11 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295
150318 15:42:11 [Warning] Could not increase number of max_open_files to more than 1024 (request: 1835)
150318 15:42:13 mysqld_safe WSREP: Recovered position 68d12982-c254-11e4-a0e8-a6fbd098225f:2642150
150318 15:42:13 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295
150318 15:42:13 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295
150318 15:42:13 [Note] WSREP: wsrep_start_position var submitted: '68d12982-c254-11e4-a0e8-a6fbd098225f:2642150'
150318 15:42:13 [Warning] Could not increase number of max_open_files to more than 1024 (request: 1835)
150318 15:42:13 [Note] WSREP: Read nil XID from storage engines, skipping position init
150318 15:42:13 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
150318 15:42:13 [Note] WSREP: wsrep_load(): Galera 3.5(rXXXX) by Codership Oy <info> loaded successfully.
150318 15:42:13 [Note] WSREP: CRC-32C: using hardware acceleration.
150318 15:42:13 [Note] WSREP: Found saved state: 68d12982-c254-11e4-a0e8-a6fbd098225f:2642150
150318 15:42:13 [Note] WSREP: Passing config to GCS: base_host = 192.168.0.4; base_port = 4567; cert.log_conflicts = no; debug = no; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 1; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = P30S; pc.weight = 1; protonet
150318 15:42:13 [Note] WSREP: Service thread queue flushed.
150318 15:42:13 [Note] WSREP: Assign initial position for certification: 2642150, protocol version: -1
150318 15:42:13 InnoDB: The InnoDB memory heap is disabled
150318 15:42:13 InnoDB: Mutexes and rw_locks use GCC atomic builtins
150318 15:42:13 InnoDB: Compressed tables use zlib 1.2.7
150318 15:42:13 InnoDB: Using Linux native AIO
150318 15:42:13 InnoDB: Initializing buffer pool, size = 128.0M
150318 15:42:13 InnoDB: Completed initialization of buffer pool
150318 15:42:13 InnoDB: highest supported file format is Barracuda.
150318 15:42:13  InnoDB: Waiting for the background threads to start
150318 15:42:14 Percona XtraDB (http://www.percona.com) 5.5.40-MariaDB-36.1 started; log sequence number 2019224366
150318 15:42:14 [Note] Plugin 'FEEDBACK' is disabled.
150318 15:42:14 [Warning] Failed to setup SSL
150318 15:42:14 [Warning] SSL error: SSL_CTX_set_default_verify_paths failed
150318 15:42:14 [Note] Server socket created on IP: '192.168.0.4'.
150318 15:42:14 [Note] Event Scheduler: Loaded 0 events
150318 15:42:14 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.40-MariaDB-wsrep'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Server, wsrep_25.11.r4026
150318 15:59:44 [Note] /usr/libexec/mysqld: Normal shutdown


on .2, we have it coming up with the broken XA state:

150318 15:04:05  InnoDB: Database was not shut down normally!
InnoDB: Starting crash recovery.
InnoDB: Reading tablespace information from the .ibd files...
InnoDB: Restoring possible half-written data pages from the doublewrite
InnoDB: buffer...
InnoDB: Doing recovery: scanned up to log sequence number 2013915322
InnoDB: Transaction 50BBE2 was in the XA prepared state.
InnoDB: 1 transaction(s) which must be rolled back or cleaned up
InnoDB: in total 0 row operations to undo
InnoDB: Trx id counter is 50BD00
150318 15:04:06  InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percents: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 
InnoDB: Apply batch completed
InnoDB: Starting in background the rollback of uncommitted transactions
150318 15:04:07  InnoDB: Rollback of non-prepared transactions completed
150318 15:04:07  InnoDB: Waiting for the background threads to start
150318 15:04:08 Percona XtraDB (http://www.percona.com) 5.5.40-MariaDB-36.1 started; log sequence number 2013915322
150318 15:04:08 [Note] Plugin 'FEEDBACK' is disabled.
150318 15:04:08  InnoDB: Starting recovery for XA transactions...
150318 15:04:08  InnoDB: Transaction 50BBE2 in prepared state after recovery
150318 15:04:08  InnoDB: Transaction contains changes to 1 rows
150318 15:04:08  InnoDB: 1 transactions in prepared state after recovery
150318 15:04:08 [Note] Found 1 prepared transaction(s) in InnoDB
150318 15:04:08 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions.
150318 15:04:08 [ERROR] Aborting

150318 15:04:08  InnoDB: Starting shutdown...
150318 15:04:10  InnoDB: Shutdown completed; log sequence number 2013919992
150318 15:04:10 [Note] /usr/libexec/mysqld: Shutdown complete'
(END)

Comment 35 Michael Bayer 2015-03-18 18:41:53 UTC
okey doke, as for why we actually see things related to XA, it seems very likely my #2 theory is right, that InnoDB uses XA internally.  this blog post seems to suggest this: http://harrison-fisk.blogspot.com/2009/01/xa-and-persistent-innodb-locks.html

so we may have to go forward assuming that in very rare cases this XA issue can occur on very bad crashes, and we may have to produce a workaround.   However, I don't know that the rest of the cluster should be affected by this.  Asking Ryan about the SSL issue.

Comment 36 Michael Bayer 2015-03-18 18:53:33 UTC
here's the ps line for the two mariadb nodes that are running.  note there are no gcomm:// addresses here:

root      64793 101928  0 20:49 pts/0    00:00:00 grep --color=auto mysql
root     110421      1  0 16:29 ?        00:00:00 /bin/sh /usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --datadir=/var/lib/mysql --log-error=/var/log/mysqld.log --user=mysql --read-only=true
mysql    111553 110421  0 16:29 ?        00:00:44 /usr/libexec/mysqld --defaults-file=/etc/my.cnf --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep-provider=/usr/lib64/galera/libgalera_smm.so --read-only=true --log-error=/var/log/mysqld.log --open-files-limit=-1 --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --port=3306 --wsrep_start_position=68d12982-c254-11e4-a0e8-a6fbd098225f:2642150

so...pacemaker / resource agent getting confused on startup by node #1 not starting, or something else, and not working correctly?    Perhaps this is unrelated to node #1 not starting at all?

Comment 37 Michael Bayer 2015-03-18 19:37:06 UTC
okey doke -

Just got done with Ryan.     What we're seeing here at the moment does not look like anything very unexpected or unreasonable at the Galera level, at least.  The failed state of the first node did *not* bring the galera cluster down at all - as far as what the logs show, the cluster stayed running just fine:

node 1 goes down, sometime after 14:54, and we see it coming up again at 15:04 and then failing to start:

150318 14:54:12 [Note] WSREP: Member 1.0 (mac848f69fbc643.example.com) synced with group.
150318 15:04:05 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql

150318 15:04:08 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions.
150318 15:04:08 [ERROR] Aborting


nodes 2 and 3 detect the outage at 14:54, rebuild a new cluster between the two galera nodes, keep running fine:

150318 14:59:43 [Note] WSREP: forgetting 2db36c8e-cd63-11e4-aec4-369b15ae4055 (ssl://192.168.0.2:4567)
150318 14:59:43 [Note] WSREP: deleting entry ssl://192.168.0.2:4567
150318 14:59:43 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
150318 14:59:43 [Note] WSREP: (fc7641ec-cd65-11e4-b054-37790c49e075, 'ssl://0.0.0.0:4567') address 'ssl://192.168.0.3:4567' pointing to uuid fc7641ec-cd65-11e4-b054-37790c49e075 is blacklisted, skipping


150318 14:59:43 [Note] WSREP: forgetting 2db36c8e-cd63-11e4-aec4-369b15ae4055 (ssl://192.168.0.2:4567)
150318 14:59:43 [Note] WSREP: deleting entry ssl://192.168.0.2:4567
150318 14:59:43 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
150318 14:59:43 [Note] WSREP: (9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3, 'ssl://0.0.0.0:4567') address 'ssl://192.168.0.4:4567' pointing to uuid 9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3 is blacklisted, skipping

So while node 1 was down from 14:54 on and never came up again, node 2 and 3 recovered without issue.  Cluster stayed up from a Galera perspective.   Node 1 was entirely unreachable so as far as HAProxy knowing how to adjust, it should not have been anything special, that is, the exotic failed state of node 1 shouldn't have affected anything - it was just an unreachable server.

However, what happens next is 40 minutes later at around 15:41, both nodes 2 *and* node 3 are bounced at approximately the same time; node 2 is restarted, node 3 doesn't seem to come back until 1629.  When both of these come back, there's no more cluster.

At that point, things are also not too surprising.    First off, when all nodes in the galera cluster are off, the cluster is gone and it would need to do bootstrap again to make a new galera cluster.  In this case, per Ryan, Pacemaker/resource agent isn't able to do a whole new bootstrap process from scratch when one of the involved nodes can't come up at all, and we think this might be normal (adding dvossel to confirm).   The only ways to get this up again are 1. restore node 1 before attempting to restart the entire cluster or 2. remove node 1 from the pacemaker cluster entirely before trying to restart.   A customer would likely do #1.   I don't know that people expect Pacemaker to build a complete cluster from scratch when one of the nodes is not able to start.

As far as the frozen state of node #1, it needs to be resolved manually; resetting grastate.dat *may* work, if it causes galera to do a full SST before it gets to the InnoDB part of it, as the corrupted InnoDB files would be replaced by rsync.

Comment 38 David Vossel 2015-03-18 20:21:31 UTC
(In reply to Michael Bayer from comment #37)
> okey doke -
> 
> Just got done with Ryan.     What we're seeing here at the moment does not
> look like anything very unexpected or unreasonable at the Galera level, at
> least.  The failed state of the first node did *not* bring the galera
> cluster down at all - as far as what the logs show, the cluster stayed
> running just fine:
> 
> node 1 goes down, sometime after 14:54, and we see it coming up again at
> 15:04 and then failing to start:
> 
> 150318 14:54:12 [Note] WSREP: Member 1.0 (mac848f69fbc643.example.com)
> synced with group.
> 150318 15:04:05 mysqld_safe Starting mysqld daemon with databases from
> /var/lib/mysql
> 
> 150318 15:04:08 [ERROR] Found 1 prepared transactions! It means that mysqld
> was not shut down properly last time and critical recovery information (last
> binlog or tc.log file) was manually deleted after a crash. You have to start
> mysqld with --tc-heuristic-recover switch to commit or rollback pending
> transactions.
> 150318 15:04:08 [ERROR] Aborting
> 
> 
> nodes 2 and 3 detect the outage at 14:54, rebuild a new cluster between the
> two galera nodes, keep running fine:
> 
> 150318 14:59:43 [Note] WSREP: forgetting
> 2db36c8e-cd63-11e4-aec4-369b15ae4055 (ssl://192.168.0.2:4567)
> 150318 14:59:43 [Note] WSREP: deleting entry ssl://192.168.0.2:4567
> 150318 14:59:43 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no,
> my_idx = 1, memb_num = 2
> 150318 14:59:43 [Note] WSREP: (fc7641ec-cd65-11e4-b054-37790c49e075,
> 'ssl://0.0.0.0:4567') address 'ssl://192.168.0.3:4567' pointing to uuid
> fc7641ec-cd65-11e4-b054-37790c49e075 is blacklisted, skipping
> 
> 
> 150318 14:59:43 [Note] WSREP: forgetting
> 2db36c8e-cd63-11e4-aec4-369b15ae4055 (ssl://192.168.0.2:4567)
> 150318 14:59:43 [Note] WSREP: deleting entry ssl://192.168.0.2:4567
> 150318 14:59:43 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no,
> my_idx = 0, memb_num = 2
> 150318 14:59:43 [Note] WSREP: (9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3,
> 'ssl://0.0.0.0:4567') address 'ssl://192.168.0.4:4567' pointing to uuid
> 9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3 is blacklisted, skipping
> 
> So while node 1 was down from 14:54 on and never came up again, node 2 and 3
> recovered without issue.  Cluster stayed up from a Galera perspective.  
> Node 1 was entirely unreachable so as far as HAProxy knowing how to adjust,
> it should not have been anything special, that is, the exotic failed state
> of node 1 shouldn't have affected anything - it was just an unreachable
> server.
> 
> However, what happens next is 40 minutes later at around 15:41, both nodes 2
> *and* node 3 are bounced at approximately the same time; node 2 is
> restarted, node 3 doesn't seem to come back until 1629.  When both of these
> come back, there's no more cluster.
> 
> At that point, things are also not too surprising.    First off, when all
> nodes in the galera cluster are off, the cluster is gone and it would need
> to do bootstrap again to make a new galera cluster.  In this case, per Ryan,
> Pacemaker/resource agent isn't able to do a whole new bootstrap process from
> scratch when one of the involved nodes can't come up at all, and we think
> this might be normal (adding dvossel to confirm).   The only ways to get
> this up again are 1. restore node 1 before attempting to restart the entire
> cluster or 2. remove node 1 from the pacemaker cluster entirely before
> trying to restart.   A customer would likely do #1.   I don't know that
> people expect Pacemaker to build a complete cluster from scratch when one of
> the nodes is not able to start.

In the event of a complete failure, the galera resource-agent can only automate the bootstrap process if all the galera instances have reported in their last known sequence number.  Right now, in order to report the last known sequence number each node must be able to start galera in read-only mode.  If a node is incapable of starting galera in read only mode, the cluster is expected to block waiting for manual intervention.


> 
> As far as the frozen state of node #1, it needs to be resolved manually;
> resetting grastate.dat *may* work, if it causes galera to do a full SST
> before it gets to the InnoDB part of it, as the corrupted InnoDB files would
> be replaced by rsync.

Comment 39 Michael Bayer 2015-03-18 22:20:29 UTC
Hi Asaf -

Everything we see so far indicates the first node went down at 14:59, failed to come back, and the cluster proceeded normally until 15:41.    We need to get information on what exactly happened at 15:41, was the system manually rebooted or restarted in some way?  The "last -x" logs on 192.168.0.3 show this:

root     pts/0        192.168.0.1      Wed Mar 18 15:45 - 15:47  (00:02)    
runlevel (to lvl 3)   3.10.0-229.el7.x Wed Mar 18 15:44 - 00:09  (08:25)    
reboot   system boot  3.10.0-229.el7.x Wed Mar 18 15:43 - 00:09  (08:25)    
shutdown system down  3.10.0-229.el7.x Wed Mar 18 15:41 - 15:43  (00:01)    
root     pts/0        192.168.0.1      Wed Mar 18 15:41 - down   (00:00)    
root     pts/1        192.168.0.1      Wed Mar 18 14:10 - down   (01:30)    
root     pts/1        192.168.0.1      Wed Mar 18 14:07 - 14:08  (00:01) 


and in the mysql log we see MySQL shutting off:

150318 15:41:43 [Note] /usr/libexec/mysqld: Normal shutdown

we then do not see MySQL on .03 coming up again until 16:29:

150318 16:29:53 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
150318 16:29:53 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.ZZFsn6' --pid-file='/var/lib/mysql/mac848f69fbc4c3.example.com-recover.pid'
150318 16:29:53 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295
150318 16:29:53 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295
150318 16:29:53 [Warning] Could not increase number of max_open_files to more than 1024 (request: 1835)


on .04, we don't see any reboot occurring but we see MySQL being stopped at 15:42:

150318 15:42:06 [Note] /usr/libexec/mysqld: Normal shutdown
150318 15:42:06 [Note] WSREP: Stop replication
150318 15:42:06 [Note] WSREP: Closing send monitor...

and it comes back immediately:

150318 15:42:11 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
150318 15:42:11 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.cS5mmd' --pid-file='/var/lib/mysql/mac848f69fbc643.example.com-recover.pid'

and another stop and start between 15:59 and 16:14:

150318 15:59:45 mysqld_safe mysqld from pid file /var/run/mysql/mysqld.pid ended
150318 16:14:46 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql


each of these restarts/reboots occurs 40 minutes after anything regarding the first node has happened.  Can we please get complete detail over all steps taken in order to initiate these sequences?   We'd like to know these restarts occurred without intervention or if this is something you were doing on your end.

Comment 40 Michael Bayer 2015-03-18 22:36:43 UTC
I've confirmed that the "--tc-heuristic-recover=rollback" option is sufficient to recover the single node from this case.  I copied the /var/lib/mysql files over to a local VM here and reproduced the same error on startup:

[root@rhel7-1 lib]# mysqld_safe 
150318 18:27:56 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
150318 18:27:56 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
150318 18:27:56 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.qGINfH' --pid-file='/var/lib/mysql/rhel7-1-recover.pid'
150318 18:27:59 mysqld_safe WSREP: Failed to recover position: 
...


150318 18:27:58  InnoDB: Transaction contains changes to 1 rows
150318 18:27:58  InnoDB: 1 transactions in prepared state after recovery
150318 18:27:58 [Note] Found 1 prepared transaction(s) in InnoDB
150318 18:27:58 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions.
150318 18:27:58 [ERROR] Aborting


then used the option and it was able to recover past this:

[root@rhel7-1 lib]# mysqld_safe  --tc-heuristic-recover=rollback
150318 18:28:41 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
150318 18:28:41 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
150318 18:28:41 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.gUbSQo' --pid-file='/var/lib/mysql/rhel7-1-recover.pid'
150318 18:28:43 mysqld_safe WSREP: Recovered position 68d12982-c254-11e4-a0e8-a6fbd098225f:2637841

the server acted more or less normally after that.

Comment 41 Asaf Hirshberg 2015-03-19 09:07:27 UTC
all the restart have done manually by me. Each reboot was during a rally task run from another server on that setup. I got into this state after I used constant load of creating instances on it and not by using rps(runs per second) as i used from the beginning of the test.

Comment 42 Perry Myers 2015-03-19 11:11:45 UTC
(In reply to Asaf Hirshberg from comment #41)
> all the restart have done manually by me. Each reboot was during a rally
> task run from another server on that setup. I got into this state after I
> used constant load of creating instances on it and not by using rps(runs per
> second) as i used from the beginning of the test.

Asaf,

We need a complete list of all reboots initiated with the times for each reboot and purpose of the reboot. For example:

14:50: reboot of 192.168.0.2 via ssh session as root from Rally
15:41: reboot of 192.168.0.3 via ssh session as root from Asaf

Something along these lines. Please get us this info, since right now we're struggling with understanding why these nodes are rebooting.

The level of detail you're providing (in general) in this bug report isn't providing enough detail to really help us puzzle out the sequence of events here. Comments like "all the restart have done manually by me" are too vague since (as noted above) we need to know the exact details of each restart (when it was done and why). Please keep this in mind when assisting us in determining root cause for complex bugs like this.


If the 0.3 and 0.4 controller nodes indeed _did_ reboot at approx 15:41, while the 0.2 node was still in an unrecovered state (stuck pending a transaction), then effectively at 15:41 you've taken down all 3 nodes of a 3 node Galera cluster, rendering the cluster inoperable.

As dvossel clarified in Comment #38, if you take down all 3 nodes cleanly, the Pacemaker Galera Resource Agent can bootstrap the cluster, but if you take down all 3 nodes and one of the nodes cannot start mariadb, then it is not possible to bootstrap the cluster from scratch. Most automated HA systems can handle one or two concurrent errors and still maintain availability. But, in the presence of multiple errors, (rebooting all 3 nodes + a transaction stuck on one node) we just do not have a way to handle that. It's an expected failure scenario that would result in needing manual intervention to recover.

What we still need to figure out then is:

* When all the reboots were and why
* What caused the XA transaction to be stuck on 0.2 node during the first reboot 
  at approx 14:50

Comment 43 Perry Myers 2015-03-19 14:03:54 UTC
Ok, I was chatting with oblaut and he cleared up some confusion for me.

Rally itself does not initiate any reboots as part of it's testing. When Asaf says above "Each reboot was during a rally task run from another server on that setup" what he means is that while Rally was running, he manually SSH'd into the respective nodes and issued a reboot command.

So, I think we now have one item left to focus on. Namely: why did the initial reboot of 0.2 node cause a transaction to get stuck in mariadb, which prevented the Galera cluster from healing back to a full 3 node cluster.

This is the question that mbayer/dvossel/abeekhof will focus on answering over the next few days, with help from Asaf.

Comment 44 Michael Bayer 2015-03-19 14:30:58 UTC
Hey Perry - 

The issue with the 0.2 node I believe is something that can just happen on rare occurrences, as described in this blog post:  http://harrison-fisk.blogspot.com/2009/01/xa-and-persistent-innodb-locks.html.

I do have the mysql datafiles that cause this corruption captured, so we can reproduce the condition in that way (actually let me attach them).

Comment 45 Asaf Hirshberg 2015-03-19 14:31:24 UTC
ok perry understood,

I started from scratch, installed a new setup and start testing,here is what i did on the setup:(fyi,controllers ip have changed) .each reboot done by me using reboot command

ssh root.0.3
./neutron-install tenant #script that create some tenants&router&networksun
--------------
ssh root@rally
rall-v task start constant.json
{
    "NovaServers.boot_and_delete_server": [
        {
            "args": {
                "sleep": 5,
                "flavor": {
                    "name": "nano"
                },
                "image": {
                    "name": "cirros"
                },
                "force_delete": false
            },
            "runner": {
                "type": "constant",
                "times": 1000,
                "concurrency": 40
            },
            "context": {
                "users": {
                    "tenants": 1,
                    "users_per_tenant": 1
                },
                "quotas": {
                    "nova": {
                        "instances": -1,
                        "cores": -1,
                        "ram": -1
                    },
                    "neutron": {
                        "port": -1,
                        "network": -1,
                        "router": -1,
                        "floatingip": -1,
                        "security_group": -1
                    }
                }

            }
            }
       ]
}
--------------

ssh root.0.3
reboot
#after server up
[root@mac848f69fbc4c3 ~]#pcs status
Master/Slave Set: galera-master [galera]
     Masters: [ pcmk-mac848f69fbc643 pcmk-macf04da2732fb1 ]
     Stopped: [ pcmk-mac848f69fbc4c3 ]

[root@mac848f69fbc4c3 ~]# mysql
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
[root@mac848f69fbc4c3 ~]# date
Thu Mar 19 14:08:13 IST 2015

[root@mac848f69fbc4c3 ~]# pcs status
Cluster name: openstack
Master/Slave Set: galera-master [galera]
     Masters: [ pcmk-mac848f69fbc4c3 pcmk-mac848f69fbc643 pcmk-macf04da2732fb1 ]
---------------
ssh root@rally
rall-v task start constant.json
--------------
[root@mac848f69fbc4c3 ~]# date;reboot
Thu Mar 19 14:22:13 IST 2015
[root@mac848f69fbc4c3 ~]# date;mysql
Thu Mar 19 14:28:12 IST 2015
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
[root@mac848f69fbc4c3 ~]# pcs status
Cluster name: openstack
 Master/Slave Set: galera-master [galera]
     Masters: [ pcmk-mac848f69fbc643 pcmk-macf04da2732fb1 ]
     Stopped: [ pcmk-mac848f69fbc4c3 ]
[root@mac848f69fbc4c3 ~]# clustercheck
HTTP/1.1 503 Service Unavailable
Content-Type: text/plain
Connection: close
Content-Length: 36

Galera cluster node is not synced.

---------------
ssh root@rally
rall-v task start constant.json
--------------
[root@staypuft ~]# ssh root.0.4
root.0.4's password: 
Last login: Thu Mar 19 14:42:56 2015 from 192.168.0.1
[root@mac848f69fbc643 ~]# reboot

#the server stayed down after the reboot and didn't start, had to start using the kvm

[root@mac848f69fbc643 ~]# date;mysql
Thu Mar 19 15:09:23 IST 2015
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111)
[root@mac848f69fbc643 ~]# 

 Master/Slave Set: galera-master [galera]
     Masters: [ pcmk-mac848f69fbc4c3 pcmk-macf04da2732fb1 ]
     Stopped: [ pcmk-mac848f69fbc643 ]




[root@macf04da2732fb1 ~]# date
Thu Mar 19 15:24:38 IST 2015
[root@macf04da2732fb1 ~]# 
[root@macf04da2732fb1 ~]# reboot
Connection to 192.168.0.5 closed by remote host.
#the server stayed down after the reboot and didn't start, had to start using the kvm

---------------
ssh root@rally
rall-v task start constant.json
--------------
[root@mac848f69fbc643 ~]# reboot
Connection to 192.168.0.4 closed by remote host.
Connection to 192.168.0.4 closed.
[root@staypuft ~]# ssh root.0.4
root.0.4's password: 
Last login: Thu Mar 19 15:40:05 2015 from 192.168.0.1
[root@mac848f69fbc643 ~]# date;mysql
Thu Mar 19 15:45:23 IST 2015
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
[root@mac848f69fbc643 ~]# 
---------------
ssh root@rally
rall-v task start constant.json
--------------
[root@staypuft ~]# ssh root.0.3
root.0.3's password: 
Last login: Thu Mar 19 14:28:04 2015 from 192.168.0.1
[root@mac848f69fbc4c3 ~]# reboot
Connection to 192.168.0.3 closed by remote host.
Connection to 192.168.0.3 closed.
[root@staypuft ~]# ssh root.0.3
root.0.3's password: 
Last login: Thu Mar 19 16:05:03 2015 from 192.168.0.1
[root@mac848f69fbc4c3 ~]# mysql
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111)
 ip-galera-pub-10.35.173.156	(ocf::heartbeat:IPaddr2):	Stopped 
 Master/Slave Set: galera-master [galera]
     galera	(ocf::heartbeat:galera):	FAILED pcmk-macf04da2732fb1 
     Stopped: [ pcmk-mac848f69fbc4c3 pcmk-mac848f69fbc643 ]

[root@mac848f69fbc4c3 ~]# date
Thu Mar 19 16:17:15 IST 2015

[root@mac848f69fbc643 ~]# mysql
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
[root@mac848f69fbc643 ~]# exit
logout
Connection to 192.168.0.4 closed.
[root@staypuft ~]# ssh root.0.5
root.0.5's password: 
Last login: Thu Mar 19 16:21:15 2015 from 192.168.0.1
[root@macf04da2732fb1 ~]# mysql
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
[root@macf04da2732fb1 ~]# exit
logout
Connection to 192.168.0.5 closed.
[root@staypuft ~]# ssh root.0.4
root.0.4's password: 
Last login: Thu Mar 19 16:25:27 2015 from 192.168.0.1
[root@mac848f69fbc643 ~]# mysql
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
[root@mac848f69fbc643 ~]# exit
logout
Connection to 192.168.0.4 closed.
[root@staypuft ~]# ssh root.0.3
root.0.3's password: 
Last login: Thu Mar 19 16:10:39 2015 from 192.168.0.1
[root@mac848f69fbc4c3 ~]# mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 28
Server version: 5.5.40-MariaDB-wsrep

Copyright (c) 2000, 2014, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]>

Comment 47 Michael Bayer 2015-03-19 14:39:43 UTC
ok attachment is too large to post.

that said, the blog post above refers to an issue that was fixed in mysql 5.5.0, years ago.   Googling for this issue pretty much otherwise turns up percona-related hits.   So this could be something that these days is specific to galera and perhaps could be a fixable bug.

Comment 48 Michael Bayer 2015-03-19 15:36:08 UTC
the above steps in comment 45 are still hard to follow, as there are missing steps (there's an ssh to root.0.4, and then a few lines down we suddenly are on another host, "[root@macf04da2732fb1 ~]# reboot
Connection to 192.168.0.5 closed by remote host.") and scant timestamps as to what is happening when.

As close as I can tell, the reboot at 14:28 was normal, the "pcs status" call shows that pcmk-mac848f69fbc4c3 didn't yet have time to join the cluster.   then the rally task is started (which is still... while pcmk-mac848f69fbc4c3 is joining?  or was this 10 minutes later?).  Then we see the reboot of mac848f69fbc643, and then we see "the server stayed down after the reboot and didn't start, had to start using the kvm".   What's *that* about?  The whole OS is failing?   From that point on, the mariadb on that node is in the failed state that we still don't understand.   

for the moment, I'm going to keep peering into the fixed "failed" files I have to try to find anything local to Galera /MariaDB as far as that individual state.

Comment 49 Michael Bayer 2015-03-19 17:07:25 UTC
for the last log, AFAICT the MySQL instance here was either shutdown by pacemaker or somehow crashed.

on macf04da2732fb1, reboot is called.   on mac848f69fbc643, pacemaker logs seem to indicate this being seen:

	Mar 19 15:24:42 [3365] mac848f69fbc643.example.com      attrd:     info: attrd_peer_update:     Setting shutdown[pcmk
		-macf04da2732fb1]: 0 -> 1426771482 from pcmk-macf04da2732fb1

	Mar 19 15:24:43 [3367] mac848f69fbc643.example.com       crmd:     info: action_synced_wait:    Managed galera_meta-data_0 process 30441 exited with rc=1


then on the mac848f69fbc643 machine, galera picks up that the node is down normally:

	150319 15:24:49 [Note] WSREP: forgetting b112d756-ce22-11e4-ae6e-86b089817ae9 (ssl://192.168.0.5:4567)
	150319 15:24:49 [Note] WSREP: deleting entry ssl://192.168.0.5:4567
	150319 15:24:49 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
	...

	150319 15:24:54 [Note] WSREP:  cleaning up b112d756-ce22-11e4-ae6e-86b089817ae9 (ssl://192.168.0.5:4567)

took out the .5 node normally.   


Then nothing for four minutes, until.. ???

        *** need to know what happened here, the mac848f69fbc643 was not rebooted but mariadb seems to have stopped ***

Now we're restarting, and we start up again, and we're into the bad XA state.

	150319 15:28:17 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
	150319 15:28:17 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.5p19Qf' --pid-file='/var/lib/mysql/mac848f69fbc643.example.com-recover.pid'


there's a lot of pacemaker logs on mac848f69fbc643 during the 15:24 period but i don't know if they refer to Galera being affected.   Can dvossel check the logs on mac848f69fbc643 to determine this ?

Comment 50 David Vossel 2015-03-19 18:34:48 UTC
(In reply to Michael Bayer from comment #49)
> for the last log, AFAICT the MySQL instance here was either shutdown by
> pacemaker or somehow crashed.

I'm fairly certain mysqld crashed. explanation below.

> on macf04da2732fb1, reboot is called.   on mac848f69fbc643, pacemaker logs
> seem to indicate this being seen:
> 
> 	Mar 19 15:24:42 [3365] mac848f69fbc643.example.com      attrd:     info:
> attrd_peer_update:     Setting shutdown[pcmk
> 		-macf04da2732fb1]: 0 -> 1426771482 from pcmk-macf04da2732fb1
> 
> 	Mar 19 15:24:43 [3367] mac848f69fbc643.example.com       crmd:     info:
> action_synced_wait:    Managed galera_meta-data_0 process 30441 exited with
> rc=1

Don't worry about this meta-data action. I don't know why it is failing but it has nothing to do with the problem at hand.

> 
> then on the mac848f69fbc643 machine, galera picks up that the node is down
> normally:
> 
> 	150319 15:24:49 [Note] WSREP: forgetting
> b112d756-ce22-11e4-ae6e-86b089817ae9 (ssl://192.168.0.5:4567)
> 	150319 15:24:49 [Note] WSREP: deleting entry ssl://192.168.0.5:4567
> 	150319 15:24:49 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no,
> my_idx = 0, memb_num = 2
> 	...
> 
> 	150319 15:24:54 [Note] WSREP:  cleaning up
> b112d756-ce22-11e4-ae6e-86b089817ae9 (ssl://192.168.0.5:4567)
> 
> took out the .5 node normally.   
> 
> 
> Then nothing for four minutes, until.. ???

right. this gap is where mysqld goes silent.


=== Sequence of events. ===

15:23:10 START ENDS - successfully started galera in read-only. 

15:23:12 PROMOTE BEGINS
    15:23:13 shutdown of read only succeeds
    15:23:19 initiate start of galera in cluster mode
    15:23:26 INFO: Waiting for database to sync with the cluster.

^^^  here the agent is waiting for galera to successfully start and sync up with the galera cluster.

    15:24:54 mysql logs go silent.

^^^ that's got to be the mysqld crash

15:28:12 - PROMOTE ENDS - Timed Out waiting for mysql to sync with the cluster... never happened because mysqld disappeared.


15:28:21 - PROMOTE FAILS again
^^^ someone had to have manually done a promote using pcs and that debug mode or something.

15:42:58 - MONITOR - reports galera isn't running.
15:43:04 - START - fails  to start because of corrupted recovery information.

Galera is hosed after the crash at 15:24:54 on this node until manual intervention.


-- David

Comment 51 Michael Bayer 2015-03-19 19:20:56 UTC
well then that's the next mystery.  Why is that same mysqld being started at 15:23?   Again we see it go silent at 14:41, and starting up at 15:23:

150319 14:41:37 [Note] WSREP: Member 1.0 (mac848f69fbc643.example.com) synced with group.
150319 14:41:37 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 13384)
150319 14:41:37 [Note] WSREP: Synchronized with group, ready for connections
150319 14:41:37 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
150319 14:41:37 [Note] WSREP: Member 0.0 (mac848f69fbc4c3.example.com) synced with group.

* silent *

150319 15:23:02 mysqld_safe Starting mysqld daemon with databases from /var/lib
/mysql
150319 15:23:02 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.XKQLh6' --pid-file='/var/lib/mysql/mac848f69fbc643.example.com-recover.pid'

Comment 52 Michael Bayer 2015-03-19 19:23:25 UTC
hi Asaf -

We're going to need very exact steps so that I can log into a brand new environment like this and run exactly the same series of commands.  We need to be able to run each step individually and examine all logs in between each step.   Can I either use this environment, or can you set up a new one, and send us a series of exact commands (1. 2. 3. 4. etc. ) to produce this sequence?   thanks.

Comment 55 Michael Bayer 2015-03-24 23:02:18 UTC
Created attachment 1006060 [details]
Python script to create database load

Comment 58 Michael Bayer 2015-04-03 15:38:04 UTC
a related issue posted for mariadb-galera: just refers to how the node that's crashed in this particular state (describing it as "the innodb tablespace is damaged") only means that single node can't automatically rejoin the cluster: https://mariadb.atlassian.net/browse/MDEV-6660

Comment 59 Michael Bayer 2015-04-03 17:03:37 UTC
So, after a week of trying to make this thing happen, at least as far as what we see in the logs I attached above, it is fairly clear what happened there from a galera/pacemaker perspective.   Looking over those logs again I can play out just what happened and I can also do the same thing locally, and it is what we theorized in comment #38.

given 192.168.1.2 in mysqld.2.log, .3 in mysqld.3.log, mysqld.4.log:

1. Let's first assume that the QE environment, which has been demonstrated to have some kind of issue with the "reboot" command, is on some occasions of calling "reboot" crashing the server, rather than gracefully rebooting it.  In .2, we see that between 14:54 and 15:04, a crash and a restart occurs:

150318 14:54:12 [Note] WSREP: Member 1.0 (mac848f69fbc643.example.com) synced with group.

150318 15:04:05 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql

2. Nodes .3 and .4 adjust to this crash just fine:

192.168.0.3:

150318 14:59:43 [Note] WSREP: forgetting 2db36c8e-cd63-11e4-aec4-369b15ae4055 (ssl://192.168.0.2:4567)

192.168.0.4:

150318 14:59:43 [Note] WSREP: forgetting 2db36c8e-cd63-11e4-aec4-369b15ae4055 (ssl://192.168.0.2:4567)

3. 192.168.1.2 crashed bad enough that the InnoDB files are corrupted, hence we see the issue described at https://mariadb.atlassian.net/browse/MDEV-6660.   By 15:04, the .2 node is done and never heard from again since it cannot start:

150318 15:04:08 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions.
150318 15:04:08 [ERROR] Aborting
150318 15:04:08  InnoDB: Starting shutdown...
150318 15:04:10  InnoDB: Shutdown completed; log sequence number 2013919992
150318 15:04:10 [Note] /usr/libexec/mysqld: Shutdown complete'

4. From here on, we're only with *two* nodes.  Yet the shutdowns of MySQL databases seems to continue.  .3 is shutdown at 15:41:

150318 15:41:43 [Note] /usr/libexec/mysqld: Normal shutdown

.4 adapts and realizes there's no more cluster; .3 is considered to be "partitioned":

150318 15:41:58 [Note] WSREP: view(view_id(NON_PRIM,9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3,14) memb {
	9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3,0
} joined {
} left {
} partitioned {
	fc7641ec-cd65-11e4-b054-37790c49e075,0
})
150318 15:41:58 [Note] WSREP: New cluster view: global state: 68d12982-c254-11e4-a0e8-a6fbd098225f:2642150, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3


5. Then at 15:42 we see node .4 is shutdown.  .3 is still shut down as well:

150318 15:42:06 [Note] /usr/libexec/mysqld: Normal shutdown

The cluster is now gone.   There's no database turned on at all.

6. .4 comes back at 15:42, to an empty cluster:

150318 15:42:11 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql

and starts in normal non-clustered mode - Pacemaker can't do anything here:

150318 15:42:14 [Note] Event Scheduler: Loaded 0 events
150318 15:42:14 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.40-MariaDB-wsrep'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Server, wsrep_25.11.r4026



7. .3 is finally brought back at 16:29:

150318 16:29:53 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql


and also comes back in non-cluster mode - because as is expected, Pacemaker can't build a cluster while .2 is still down:

150318 16:29:57 [Note] Event Scheduler: Loaded 0 events
150318 16:29:57 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.40-MariaDB-wsrep'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Server, wsrep_25.11.r4026


And that's what happened!   What isn't clear is how these MySQL databases were shut off, or if they crashed, or what.  But there's nothing here to show that anything happened to them other than the machines were being constantly rebooted on an environment that has known issues with clean reboots.

Googling around for the issue regarding the "tc-heuristic-recover" issue turns up quite a few results, all of which involve MariaDB-Galera / Percona.  This seems to be a thing that can happen.  However none of these issues or blog posts say anything about this situation being linked to other nodes in the Galera cluster having problems.

After staring at this for two weeks and getting a lot more comfortable watching galera and pacemaker restart over and over again, I'm not finding evidence for anything unusual within the scope of these tools.

Comment 60 David Vossel 2015-04-06 13:41:51 UTC
okay, in short.

1. galera crashes on one node1
2. node2 and node3 gracefully reboot
3. galera cluster can't start again on node2 and node3 because node1 data is corrupted from crash. (when all galera instances are down, pacemaker has to see all galera instances in read-only before pacemaker lets the galera cluster reform)

In this condition, we expect pacemaker to block starting of the galera cluster until manual intervention takes place to figure out what is going on with the data corruption. If we determine there is a safe way to automate proceeding with recovery here, I can add that procedure to the agent. We've known that galera theoretically had the potential to block automated recovery, but now that I'm seeing this blocking behavior reliably reproduced it's kind of troubling. 

-- David

Comment 61 Michael Bayer 2015-04-08 22:59:08 UTC
So we're not able to reproduce this in any way other than what would normally be expected if one of the Galera nodes crashes, and the logfiles we've observed all seem to indicate that the cluster stays running fine when the first node is intentionally crashed out.  The other nodes come back without issue, and only if the remaining two are stopped at the same time does Pacemaker refuse to join them back to a cluster.   No logs indicate other galera nodes having any kind of problem when the initial node goes down and this hasn't been reproduced in isolation after much trying, nor are there reports in the wild of anything similar.

There is some discussion about the "server won't reboot" issue on other channels but that should be opened as a separate issue.

Comment 62 Ofer Blaut 2015-04-29 05:52:16 UTC
Work around to recover for mariadb failure is suggested in https://access.redhat.com/solutions/1283463

Comment 63 Asaf Hirshberg 2015-05-10 12:34:31 UTC
I tried to reproduced it again, I used 3 controllers, 8 compute setup (physical).
I have ran Rally "boot-run-command" and rebooted over and over the same controller (over 10 times) and every time galera and the rest of the cluster where managed to recover without any problem.

Comment 64 Salman Khan 2015-06-09 11:51:36 UTC
Work around link https://access.redhat.com/solutions/1283463 isn't a working. 
I am hiting this issue again and again. Can someone provide the workaroud link.


Note You need to log in before you can comment on or make changes to this bug.