Created attachment 994359 [details] /var/log/messages Description of problem: we tested the new pudlle on HA-neutron environment bare-metal machines with rally, we created a task that create and delete 20 instances a time. when i reboot a server in the middle of a task the server failed to start galera cluster on initialization process. 150222 18:32:59 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions. Version-Release number of selected component (if applicable): rhel-osp-installer-client-0.5.5-5.el7ost.noarch rhel-osp-installer-0.5.5-5.el7ost.noarch mariadb-galera-server-5.5.40-6.el7ost.x86_64 mariadb-galera-common-5.5.40-6.el7ost.x86_64 galera-25.3.5-6.el7ost.x86_64 mariadb-5.5.41-2.el7_0.x86_64 mariadb-libs-5.5.41-2.el7_0.x86_64 How reproducible: 3/3 Steps to Reproduce: 1.create an HA-neutron setup with 3 controllers and 1 compute. 2.start a rallly task that's boot&delete 20 vms. 3.reboot a server.
Created attachment 994360 [details] mariadb log
Created attachment 994364 [details] Errors from Asaf setup
Created attachment 994365 [details] Errors from Leonid setup
A quick search suggests that this is the solution: mysqld --tc-heuristic-recover=ROLLBACK
(In reply to Ryan O'Hara from comment #8) > A quick search suggests that this is the solution: > > mysqld --tc-heuristic-recover=ROLLBACK mysqld doesn't exist is it mysqld_safe ? Is it relate to https://mariadb.atlassian.net/browse/MDEV-6660 ?
(In reply to Ofer Blaut from comment #9) > (In reply to Ryan O'Hara from comment #8) > > A quick search suggests that this is the solution: > > > > mysqld --tc-heuristic-recover=ROLLBACK > > mysqld doesn't exist is it mysqld_safe ? mysqld does exist, it just is not in your default path. Anyway, just use mysqld_safe. > Is it relate to https://mariadb.atlassian.net/browse/MDEV-6660 ? Maybe.
As far as I can tell mariadb should not and will not start until you clear up the pending transaction.
Hi Ryan This is not HA, server might fail in the middle of transaction Cluster must keep on working.
(In reply to Ofer Blaut from comment #12) > Hi Ryan > > This is not HA, server might fail in the middle of transaction > > Cluster must keep on working. You should have 3 controllers, thus 3 mariadb nodes. According to the bug description you suddenly rebooted one of the controllers. It turns out that contoller was in the middle of a transaction, so when you brought the node back online it failed to join the galera cluster. Is this correct so far? In this case you can't join the cluster because mariadb will not start until you tell it what to do about the pending transaction. This requires manual intervention. In the meantime, your other two galera nodes should be working. The reason this should bz should be closed is because mariadb-galera, as far as I can tell, is working as expected.
Adding needinfo on Ofer to review comment #13 and close BZ if agreed.
The other two controllers were in Slave State and Galera state was "not synced" .
@mbayer: Can you please take this bug over from rohara? As discussed on list there are multiple items at play here, but the only thing we should be concerned about in this specific bug is whether or not a reboot of a single node in the Galera cluster would cause the other two nodes to go offline. We cannot fix (in this bug) the issue where a transaction is in flight and requires manually unblocking. Apparently, that is caused by a problem with systemd or an improper unit file not waiting for the init scripts to complete before halting the system. A separate bug should be filed by QE for that specific issue. But for this issue, what we need to determine is if we can replicate a situation where a failure in one Galera node causes the other two to become non-operational and therefore taking the entire database offline.
(In reply to Ofer Blaut from comment #15) > The other two controllers were in Slave State Where are you seeing "slave state"?
It was seen in pcs status
(In reply to Ofer Blaut from comment #18) > It was seen in pcs status OK. I don't think that applies here, but I defer to the pacemaker folks. Galera itself is multi-master.
OK...aren't we typically setting things up so that all the apps are talking to *one* actual master via HAProxy? That would be the state in which the two other nodes are working fine but the one chosen as the "master" is seen as "out of sync". But clustercheck on that node would be reporting "not synced" for that node, and I'd assume HAProxy would move to another node, hence cluster stays up. What is the origin of the message "not synced", is that clustercheck, and is that the state reported by all nodes or just the one that was rebooted?
Asaf / Ofer - we need to confirm that in the "failed state" described in this issue, that the Galera cluster itself is failed, e.g. that connecting directly to the two other MariaDB instances cannot connect or connects to a database that is unusable, and if so, what the state of those two databases exactly is. There's a mariadb log here but I assume that is from the node with the stuck transaction on it. If the other two MariaDBs are usable, but the "cluster" is unavailable from a pacemaker pov, then the issue is elsewhere, e.g. pacemaker or haproxy. Can this "failed state" be reproduced in an environment I can login towards? That way I can just see what exactly is failed.
other q: 1. /var/log/messages log shows mariadb being started by systemd. under pacemaker, this should not occur, mariadb should be started using ocf resource agents. 2. I can't find the context for the "Found 1 prepared transactions! " error in the attached logfiles. The errors within the logfiles seem to indicate a conflict of file access (two different systems trying to start mariadb at the same time?) as a key issue. Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Unable to lock ./ibdata1, error: 11 Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Check that you do not already have another mysqld process Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: using the same InnoDB data or log files. Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Unable to lock ./ibdata1, error: 11 Feb 23 12:31:55 mac441ea1733d43 galera(galera)[29571]: INFO: MySQL not running: removing old PID file Feb 23 12:31:55 mac441ea1733d43 galera(galera)[29571]: INFO: MySQL stopped Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Check that you do not already have another mysqld process Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: using the same InnoDB data or log files. Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Unable to lock ./ibdata1, error: 11 Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Check that you do not already have another mysqld process Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: using the same InnoDB data or log files. Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Unable to lock ./ibdata1, error: 11 ... Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: using the same InnoDB data or log files. Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Unable to lock ./ibdata1, error: 11 Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Check that you do not already have another mysqld process Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: using the same InnoDB data or log files. Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Unable to lock ./ibdata1, error: 11 Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Check that you do not already have another mysqld process Feb 23 12:31:55 mac441ea1733d43 systemd: mariadb.service: main process exited, code=exited, status=1/FAILURE Feb 23 12:31:55 mac441ea1733d43 systemd: Unit mariadb.service entered failed state. Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: using the same InnoDB data or log files. Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Unable to lock ./ibdata1, error: 11 Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Check that you do not already have another mysqld process Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: using the same InnoDB data or log files. Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Unable to lock ./ibdata1, error: 11 Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: Check that you do not already have another mysqld process Feb 23 12:31:55 mac441ea1733d43 mysqld_safe: InnoDB: using the same InnoDB data or log files.
re: --tc-heuristic-recover 1. MySQL's official documentation says the option is unused: http://dev.mysql.com/doc/refman/5.6/en/server-options.html#option_mysqld_tc-heuristic-recover, however considering percona devs have been discussing this flag, this would appear to be inaccurate? 2. the message "prepared transaction" implies that this is a two phase transaction that needs to be rolled back. This percona article supports that notion: http://www.percona.com/blog/2014/07/15/tokudb-tips-mysql-backups/. Question. What part of Openstack is using two-phase transactions? this would seem to be entirely inappropriate? need to look into rally etc. and see what's going on. 3. same percona article states that the flag is entirely broken due to this issue, Oct 14: http://bugs.mysql.com/bug.php?id=70860, percona's issue: https://bugs.launchpad.net/percona-server/+bug/1334330
codership / Galera: https://mariadb.com/kb/en/mariadb/mariadb-galera-cluster-known-limitations/ "XA transactions are not supported.". https://mariadb.com/kb/en/mariadb/xa-transactions/ "In MariaDB, XA transactions can only be used with storage engines that support them. At least InnoDB, TokuDB and SPIDER support them. For InnoDB, XA transactions can be disabled by setting the innodb_support_xa server system variable to 0. XA is currently not available with MariaDB Galera Cluster." TokuDB above is what's mentioned in the Percona article where they're dealing with this flag.
(In reply to Michael Bayer from comment #20) > OK...aren't we typically setting things up so that all the apps are talking > to *one* actual master via HAProxy? Correct > That would be the state in which the > two other nodes are working fine but the one chosen as the "master" is seen > as "out of sync". But clustercheck on that node would be reporting "not > synced" for that node, and I'd assume HAProxy would move to another node, > hence cluster stays up. HAProxy would take that node out of rotation yes. However it seems that this node going down is causing pacemaker's monitoring of galera to detect some kind of failures on the remaining nodes - hence why they appear to be in a slave state. I see: Feb 23 12:06:56 mac441ea1733d43 lrmd[2935]: notice: operation_finished: galera_monitor_20000:21982:stderr [ ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2) ] Feb 23 12:06:56 mac441ea1733d43 lrmd[2935]: notice: operation_finished: galera_monitor_20000:21982:stderr [ ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2) ] Feb 23 12:06:56 mac441ea1733d43 lrmd[2935]: notice: operation_finished: galera_monitor_20000:21982:stderr [ ocf-exit-reason:Unable to retrieve wsrep_cluster_status, verify check_user 'monitor_user' has permissions to view status ] Feb 23 12:06:56 mac441ea1733d43 lrmd[2935]: notice: operation_finished: galera_monitor_20000:21982:stderr [ ocf-exit-reason:local node <pcmk-mac441ea1733d43> is neither in primary mode nor in read_only mode. Unknown state. ] > What is the origin of the message "not synced", is that clustercheck, and is > that the state reported by all nodes or just the one that was rebooted?
I reproduced it on A1 environment. bare-metal ha-neutron. running rally boot&delete 20 instances as constant load, and during the task reboot one of the controllers with "reboot" command. the server start but mysql was down and output of "clustercheck" is "cluster not synced". After some time I checked the other controllers and saw that mysql is down on the oter controllers. I am keeping the setup if someone want to debug it. Also im adding var log and mysql log from the reboot controller and another contoller in the cluster. rpm: foreman-installer-1.6.0-0.3.RC1.el7ost.noarch openstack-foreman-installer-3.0.16-1.el7ost.noarch rhel-osp-installer-0.5.5-5.el7ost.noarch rhel-osp-installer-client-0.5.5-5.el7ost.noarch mariadb-galera-server-5.5.40-6.el7ost.x86_64 mariadb-galera-common-5.5.40-6.el7ost.x86_64 galera-25.3.5-6.el7ost.x86_64 MySQL-python-1.2.3-11.el7.x86_64 mariadb-5.5.41-2.el7_0.x86_64 mariadb-libs-5.5.41-2.el7_0.x86_64
Created attachment 1003252 [details] server log of a1 reproduce
ip-galera-pub-10.35.173.156 (ocf::heartbeat:IPaddr2): Started pcmk-mac848f69fbc643 Master/Slave Set: galera-master [galera] Slaves: [ pcmk-mac848f69fbc643 ] Stopped: [ pcmk-mac848f69fbc49f pcmk-mac848f69fbc4c3 ]
(In reply to Asaf Hirshberg from comment #26) > I reproduced it on A1 environment. bare-metal ha-neutron. > running rally boot&delete 20 instances as constant load, and during the task > reboot one of the controllers with "reboot" command. the server start but > mysql was down and output of "clustercheck" is "cluster not synced". After > some time I checked the other controllers and saw that mysql is down on the > oter controllers. Don't rely on clustercheck. It is not a debugging tool. Its only purpose is to give an HTTP response to haproxy for healthcheck purposes. Specifically, it just tells haproxy if the node is synced and therefore available for reads/writes. This article I wrote is outdated, but still has some valid bits: https://www.rdoproject.org/MariaDB_Galera If you run 'mysql -e "SHOW STATUS LIKE 'wsrep%';" you will get a full picture of what the state is. All clustercheck does is look to see if wsrep_local_state is 4.
(In reply to Ryan O'Hara from comment #29) > > Don't rely on clustercheck. It is not a debugging tool. Its only purpose is > to give an HTTP response to haproxy for healthcheck purposes. right, but clustercheck would be one reason why HAProxy wouldn't go to those nodes even if they are running fine (though sounds like they are not). I'm going to ssh into the machine now and see what's up, specifically i want to see if systemd and resource agents are competing to start the service.
Created attachment 1003350 [details] mysqld.log from 192.168.0.2
Created attachment 1003351 [details] mysqld.log from 192.168.0.3
Created attachment 1003352 [details] mysqld.log from 192.168.0.4
So I've looked in the logs, so far: 1. with the latest reproduction case, systemd is not involved at all. The services are not part of systemd and there's nothing about systemd in the logs related to mariadb/mysql. All the galera nodes seem to be started with the resource manager. 2. it looks like to reproduce this, there's a lot of stops/starts required as I see lots of restarts in all three nodes. 3. the 192.168.0.2 node does in fact report fully on an XA transaction that needs to be recovered. This should not be the case, as XA transactions aren't supported with Galera. I have three very general theories: 1. something is actually trying to do an XA transaction on the cluster, and if that occurs while the restart does, things break. 2. Galera somehow uses the XA system internally in some way, and that's what we see. 3. This is not really an XA transaction issue; the InnoDB state is corrupted in some strange way such that it *thinks* this is what it sees. 4. The 0.3/0.4 nodes do restart, but they fail to create a cluster once the .2 node is down, and the error reported is "SSL error: SSL_CTX_set_default_verify_paths failed". This error appears many times in all three logs on some of the startups, but once the .2 node is permanently down, it seems to persist. Segments of logs are below: .2 had its outage after 14:54, and was restarting at 15:04: 150318 14:54:09 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 2636680) 150318 14:54:09 [Note] WSREP: Synchronized with group, ready for connections 150318 14:54:09 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 150318 14:54:12 [Note] WSREP: Member 1.0 (mac848f69fbc643.example.com) synced with group. 150318 15:04:05 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql 150318 15:04:05 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.aR5WGz' --pid-file='/var/lib/mysql/mac848f69fbc49f.example.com-recover.pid' in .3 we can see the node dropped correctly: 150318 14:59:40 [Note] WSREP: (fc7641ec-cd65-11e4-b054-37790c49e075, 'ssl://0.0.0.0:4567') address 'ssl://192.168.0.3:4567' pointing to uuid fc7641ec-cd65-11e4-b054-37790c49e 075 is blacklisted, skipping 150318 14:59:40 [Note] WSREP: (fc7641ec-cd65-11e4-b054-37790c49e075, 'ssl://0.0.0.0:4567') turning message relay requesting on, nonlive peers: ssl://192.168.0.2:4567 150318 14:59:40 [Note] WSREP: (fc7641ec-cd65-11e4-b054-37790c49e075, 'ssl://0.0.0.0:4567') address 'ssl://192.168.0.3:4567' pointing to uuid fc7641ec-cd65-11e4-b054-37790c49e 075 is blacklisted, skipping 150318 14:59:41 [Note] WSREP: (fc7641ec-cd65-11e4-b054-37790c49e075, 'ssl://0.0.0.0:4567') reconnecting to 2db36c8e-cd63-11e4-aec4-369b15ae4055 (ssl://192.168.0.2:4567), atte mpt 0 150318 14:59:42 [Note] WSREP: evs::proto(fc7641ec-cd65-11e4-b054-37790c49e075, OPERATIONAL, view_id(REG,2db36c8e-cd63-11e4-aec4-369b15ae4055,12)) suspecting node: 2db36c8e-cd 63-11e4-aec4-369b15ae4055 150318 14:59:42 [Note] WSREP: evs::proto(fc7641ec-cd65-11e4-b054-37790c49e075, GATHER, view_id(REG,2db36c8e-cd63-11e4-aec4-369b15ae4055,12)) suspecting node: 2db36c8e-cd63-11 e4-aec4-369b15ae4055 150318 14:59:43 [Note] WSREP: evs::proto(fc7641ec-cd65-11e4-b054-37790c49e075, GATHER, view_id(REG,2db36c8e-cd63-11e4-aec4-369b15ae4055,12)) suspecting node: 2db36c8e-cd63-11 e4-aec4-369b15ae4055 150318 14:59:43 [Note] WSREP: declaring 9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3 stable 150318 14:59:43 [Note] WSREP: Node 9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3 state prim 150318 14:59:43 [Note] WSREP: view(view_id(PRIM,9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3,13) memb { 9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3,0 fc7641ec-cd65-11e4-b054-37790c49e075,0 } joined { } left { } partitioned { 2db36c8e-cd63-11e4-aec4-369b15ae4055,0 }) 150318 14:59:43 [Note] WSREP: forgetting 2db36c8e-cd63-11e4-aec4-369b15ae4055 (ssl://192.168.0.2:4567) 150318 14:59:43 [Note] WSREP: deleting entry ssl://192.168.0.2:4567 150318 14:59:43 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2 150318 14:59:43 [Note] WSREP: (fc7641ec-cd65-11e4-b054-37790c49e075, 'ssl://0.0.0.0:4567') address 'ssl://192.168.0.3:4567' pointing to uuid fc7641ec-cd65-11e4-b054-37790c49e 075 is blacklisted, skipping 150318 14:59:43 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID. 150318 14:59:43 [Note] WSREP: (fc7641ec-cd65-11e4-b054-37790c49e075, 'ssl://0.0.0.0:4567') turning message relay requesting off 150318 14:59:43 [Note] WSREP: STATE EXCHANGE: sent state msg: a55168ce-cd6e-11e4-aaed-a7966a228361 150318 14:59:43 [Note] WSREP: STATE EXCHANGE: got state msg: a55168ce-cd6e-11e4-aaed-a7966a228361 from 0 (mac848f69fbc643.example.com) 150318 14:59:43 [Note] WSREP: STATE EXCHANGE: got state msg: a55168ce-cd6e-11e4-aaed-a7966a228361 from 1 (mac848f69fbc4c3.example.com) 150318 14:59:43 [Note] WSREP: Quorum results: version = 3, component = PRIMARY, conf_id = 12, members = 2/2 (joined/total), act_id = 2637851, last_appl. = 2637726, protocols = 0/5/3 (gcs/repl/appl), group UUID = 68d12982-c254-11e4-a0e8-a6fbd098225f similar on .4: 150318 14:59:43 [Note] WSREP: view(view_id(PRIM,9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3,13) memb { 9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3,0 fc7641ec-cd65-11e4-b054-37790c49e075,0 } joined { } left { } partitioned { 2db36c8e-cd63-11e4-aec4-369b15ae4055,0 }) 150318 14:59:43 [Note] WSREP: forgetting 2db36c8e-cd63-11e4-aec4-369b15ae4055 (ssl://192.168.0.2:4567) 150318 14:59:43 [Note] WSREP: deleting entry ssl://192.168.0.2:4567 150318 14:59:43 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2 150318 14:59:43 [Note] WSREP: (9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3, 'ssl://0.0.0.0:4567') address 'ssl://192.168.0.4:4567' pointing to uuid 9d004ec7-cd6d-11e4-a36d-d2cd1c331 ef3 is blacklisted, skipping 150318 14:59:43 [Note] WSREP: (9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3, 'ssl://0.0.0.0:4567') turning message relay requesting off 150318 14:59:43 [Note] WSREP: STATE_EXCHANGE: sent state UUID: a55168ce-cd6e-11e4-aaed-a7966a228361 150318 14:59:43 [Note] WSREP: STATE EXCHANGE: sent state msg: a55168ce-cd6e-11e4-aaed-a7966a228361 150318 14:59:43 [Note] WSREP: STATE EXCHANGE: got state msg: a55168ce-cd6e-11e4-aaed-a7966a228361 from 0 (mac848f69fbc643.example.com) 150318 14:59:43 [Note] WSREP: STATE EXCHANGE: got state msg: a55168ce-cd6e-11e4-aaed-a7966a228361 from 1 (mac848f69fbc4c3.example.com) 150318 14:59:43 [Note] WSREP: Quorum results: version = 3, component = PRIMARY, conf_id = 12, members = 2/2 (joined/total), act_id = 2637851, last_appl. = 2637726, protocols = 0/5/3 (gcs/repl/appl), group UUID = 68d12982-c254-11e4-a0e8-a6fbd098225f subsequent startups on .3 and .4 fail to set up SSL correctly, we get "SSL error: SSL_CTX_set_default_verify_paths failed" and the database starts but the cluster does not: 150318 15:42:11 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql 150318 15:42:11 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.cS5mmd' --pid-file='/var/lib/mysql/mac848f69fbc643.example.com-recover.pid' 150318 15:42:11 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295 150318 15:42:11 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295 150318 15:42:11 [Warning] Could not increase number of max_open_files to more than 1024 (request: 1835) 150318 15:42:13 mysqld_safe WSREP: Recovered position 68d12982-c254-11e4-a0e8-a6fbd098225f:2642150 150318 15:42:13 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295 150318 15:42:13 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295 150318 15:42:13 [Note] WSREP: wsrep_start_position var submitted: '68d12982-c254-11e4-a0e8-a6fbd098225f:2642150' 150318 15:42:13 [Warning] Could not increase number of max_open_files to more than 1024 (request: 1835) 150318 15:42:13 [Note] WSREP: Read nil XID from storage engines, skipping position init 150318 15:42:13 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so' 150318 15:42:13 [Note] WSREP: wsrep_load(): Galera 3.5(rXXXX) by Codership Oy <info> loaded successfully. 150318 15:42:13 [Note] WSREP: CRC-32C: using hardware acceleration. 150318 15:42:13 [Note] WSREP: Found saved state: 68d12982-c254-11e4-a0e8-a6fbd098225f:2642150 150318 15:42:13 [Note] WSREP: Passing config to GCS: base_host = 192.168.0.4; base_port = 4567; cert.log_conflicts = no; debug = no; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 1; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = P30S; pc.weight = 1; protonet 150318 15:42:13 [Note] WSREP: Service thread queue flushed. 150318 15:42:13 [Note] WSREP: Assign initial position for certification: 2642150, protocol version: -1 150318 15:42:13 InnoDB: The InnoDB memory heap is disabled 150318 15:42:13 InnoDB: Mutexes and rw_locks use GCC atomic builtins 150318 15:42:13 InnoDB: Compressed tables use zlib 1.2.7 150318 15:42:13 InnoDB: Using Linux native AIO 150318 15:42:13 InnoDB: Initializing buffer pool, size = 128.0M 150318 15:42:13 InnoDB: Completed initialization of buffer pool 150318 15:42:13 InnoDB: highest supported file format is Barracuda. 150318 15:42:13 InnoDB: Waiting for the background threads to start 150318 15:42:14 Percona XtraDB (http://www.percona.com) 5.5.40-MariaDB-36.1 started; log sequence number 2019224366 150318 15:42:14 [Note] Plugin 'FEEDBACK' is disabled. 150318 15:42:14 [Warning] Failed to setup SSL 150318 15:42:14 [Warning] SSL error: SSL_CTX_set_default_verify_paths failed 150318 15:42:14 [Note] Server socket created on IP: '192.168.0.4'. 150318 15:42:14 [Note] Event Scheduler: Loaded 0 events 150318 15:42:14 [Note] /usr/libexec/mysqld: ready for connections. Version: '5.5.40-MariaDB-wsrep' socket: '/var/lib/mysql/mysql.sock' port: 3306 MariaDB Server, wsrep_25.11.r4026 150318 15:59:44 [Note] /usr/libexec/mysqld: Normal shutdown on .2, we have it coming up with the broken XA state: 150318 15:04:05 InnoDB: Database was not shut down normally! InnoDB: Starting crash recovery. InnoDB: Reading tablespace information from the .ibd files... InnoDB: Restoring possible half-written data pages from the doublewrite InnoDB: buffer... InnoDB: Doing recovery: scanned up to log sequence number 2013915322 InnoDB: Transaction 50BBE2 was in the XA prepared state. InnoDB: 1 transaction(s) which must be rolled back or cleaned up InnoDB: in total 0 row operations to undo InnoDB: Trx id counter is 50BD00 150318 15:04:06 InnoDB: Starting an apply batch of log records to the database... InnoDB: Progress in percents: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 InnoDB: Apply batch completed InnoDB: Starting in background the rollback of uncommitted transactions 150318 15:04:07 InnoDB: Rollback of non-prepared transactions completed 150318 15:04:07 InnoDB: Waiting for the background threads to start 150318 15:04:08 Percona XtraDB (http://www.percona.com) 5.5.40-MariaDB-36.1 started; log sequence number 2013915322 150318 15:04:08 [Note] Plugin 'FEEDBACK' is disabled. 150318 15:04:08 InnoDB: Starting recovery for XA transactions... 150318 15:04:08 InnoDB: Transaction 50BBE2 in prepared state after recovery 150318 15:04:08 InnoDB: Transaction contains changes to 1 rows 150318 15:04:08 InnoDB: 1 transactions in prepared state after recovery 150318 15:04:08 [Note] Found 1 prepared transaction(s) in InnoDB 150318 15:04:08 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions. 150318 15:04:08 [ERROR] Aborting 150318 15:04:08 InnoDB: Starting shutdown... 150318 15:04:10 InnoDB: Shutdown completed; log sequence number 2013919992 150318 15:04:10 [Note] /usr/libexec/mysqld: Shutdown complete' (END)
okey doke, as for why we actually see things related to XA, it seems very likely my #2 theory is right, that InnoDB uses XA internally. this blog post seems to suggest this: http://harrison-fisk.blogspot.com/2009/01/xa-and-persistent-innodb-locks.html so we may have to go forward assuming that in very rare cases this XA issue can occur on very bad crashes, and we may have to produce a workaround. However, I don't know that the rest of the cluster should be affected by this. Asking Ryan about the SSL issue.
here's the ps line for the two mariadb nodes that are running. note there are no gcomm:// addresses here: root 64793 101928 0 20:49 pts/0 00:00:00 grep --color=auto mysql root 110421 1 0 16:29 ? 00:00:00 /bin/sh /usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --datadir=/var/lib/mysql --log-error=/var/log/mysqld.log --user=mysql --read-only=true mysql 111553 110421 0 16:29 ? 00:00:44 /usr/libexec/mysqld --defaults-file=/etc/my.cnf --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep-provider=/usr/lib64/galera/libgalera_smm.so --read-only=true --log-error=/var/log/mysqld.log --open-files-limit=-1 --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --port=3306 --wsrep_start_position=68d12982-c254-11e4-a0e8-a6fbd098225f:2642150 so...pacemaker / resource agent getting confused on startup by node #1 not starting, or something else, and not working correctly? Perhaps this is unrelated to node #1 not starting at all?
okey doke - Just got done with Ryan. What we're seeing here at the moment does not look like anything very unexpected or unreasonable at the Galera level, at least. The failed state of the first node did *not* bring the galera cluster down at all - as far as what the logs show, the cluster stayed running just fine: node 1 goes down, sometime after 14:54, and we see it coming up again at 15:04 and then failing to start: 150318 14:54:12 [Note] WSREP: Member 1.0 (mac848f69fbc643.example.com) synced with group. 150318 15:04:05 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql 150318 15:04:08 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions. 150318 15:04:08 [ERROR] Aborting nodes 2 and 3 detect the outage at 14:54, rebuild a new cluster between the two galera nodes, keep running fine: 150318 14:59:43 [Note] WSREP: forgetting 2db36c8e-cd63-11e4-aec4-369b15ae4055 (ssl://192.168.0.2:4567) 150318 14:59:43 [Note] WSREP: deleting entry ssl://192.168.0.2:4567 150318 14:59:43 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2 150318 14:59:43 [Note] WSREP: (fc7641ec-cd65-11e4-b054-37790c49e075, 'ssl://0.0.0.0:4567') address 'ssl://192.168.0.3:4567' pointing to uuid fc7641ec-cd65-11e4-b054-37790c49e075 is blacklisted, skipping 150318 14:59:43 [Note] WSREP: forgetting 2db36c8e-cd63-11e4-aec4-369b15ae4055 (ssl://192.168.0.2:4567) 150318 14:59:43 [Note] WSREP: deleting entry ssl://192.168.0.2:4567 150318 14:59:43 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2 150318 14:59:43 [Note] WSREP: (9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3, 'ssl://0.0.0.0:4567') address 'ssl://192.168.0.4:4567' pointing to uuid 9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3 is blacklisted, skipping So while node 1 was down from 14:54 on and never came up again, node 2 and 3 recovered without issue. Cluster stayed up from a Galera perspective. Node 1 was entirely unreachable so as far as HAProxy knowing how to adjust, it should not have been anything special, that is, the exotic failed state of node 1 shouldn't have affected anything - it was just an unreachable server. However, what happens next is 40 minutes later at around 15:41, both nodes 2 *and* node 3 are bounced at approximately the same time; node 2 is restarted, node 3 doesn't seem to come back until 1629. When both of these come back, there's no more cluster. At that point, things are also not too surprising. First off, when all nodes in the galera cluster are off, the cluster is gone and it would need to do bootstrap again to make a new galera cluster. In this case, per Ryan, Pacemaker/resource agent isn't able to do a whole new bootstrap process from scratch when one of the involved nodes can't come up at all, and we think this might be normal (adding dvossel to confirm). The only ways to get this up again are 1. restore node 1 before attempting to restart the entire cluster or 2. remove node 1 from the pacemaker cluster entirely before trying to restart. A customer would likely do #1. I don't know that people expect Pacemaker to build a complete cluster from scratch when one of the nodes is not able to start. As far as the frozen state of node #1, it needs to be resolved manually; resetting grastate.dat *may* work, if it causes galera to do a full SST before it gets to the InnoDB part of it, as the corrupted InnoDB files would be replaced by rsync.
(In reply to Michael Bayer from comment #37) > okey doke - > > Just got done with Ryan. What we're seeing here at the moment does not > look like anything very unexpected or unreasonable at the Galera level, at > least. The failed state of the first node did *not* bring the galera > cluster down at all - as far as what the logs show, the cluster stayed > running just fine: > > node 1 goes down, sometime after 14:54, and we see it coming up again at > 15:04 and then failing to start: > > 150318 14:54:12 [Note] WSREP: Member 1.0 (mac848f69fbc643.example.com) > synced with group. > 150318 15:04:05 mysqld_safe Starting mysqld daemon with databases from > /var/lib/mysql > > 150318 15:04:08 [ERROR] Found 1 prepared transactions! It means that mysqld > was not shut down properly last time and critical recovery information (last > binlog or tc.log file) was manually deleted after a crash. You have to start > mysqld with --tc-heuristic-recover switch to commit or rollback pending > transactions. > 150318 15:04:08 [ERROR] Aborting > > > nodes 2 and 3 detect the outage at 14:54, rebuild a new cluster between the > two galera nodes, keep running fine: > > 150318 14:59:43 [Note] WSREP: forgetting > 2db36c8e-cd63-11e4-aec4-369b15ae4055 (ssl://192.168.0.2:4567) > 150318 14:59:43 [Note] WSREP: deleting entry ssl://192.168.0.2:4567 > 150318 14:59:43 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, > my_idx = 1, memb_num = 2 > 150318 14:59:43 [Note] WSREP: (fc7641ec-cd65-11e4-b054-37790c49e075, > 'ssl://0.0.0.0:4567') address 'ssl://192.168.0.3:4567' pointing to uuid > fc7641ec-cd65-11e4-b054-37790c49e075 is blacklisted, skipping > > > 150318 14:59:43 [Note] WSREP: forgetting > 2db36c8e-cd63-11e4-aec4-369b15ae4055 (ssl://192.168.0.2:4567) > 150318 14:59:43 [Note] WSREP: deleting entry ssl://192.168.0.2:4567 > 150318 14:59:43 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, > my_idx = 0, memb_num = 2 > 150318 14:59:43 [Note] WSREP: (9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3, > 'ssl://0.0.0.0:4567') address 'ssl://192.168.0.4:4567' pointing to uuid > 9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3 is blacklisted, skipping > > So while node 1 was down from 14:54 on and never came up again, node 2 and 3 > recovered without issue. Cluster stayed up from a Galera perspective. > Node 1 was entirely unreachable so as far as HAProxy knowing how to adjust, > it should not have been anything special, that is, the exotic failed state > of node 1 shouldn't have affected anything - it was just an unreachable > server. > > However, what happens next is 40 minutes later at around 15:41, both nodes 2 > *and* node 3 are bounced at approximately the same time; node 2 is > restarted, node 3 doesn't seem to come back until 1629. When both of these > come back, there's no more cluster. > > At that point, things are also not too surprising. First off, when all > nodes in the galera cluster are off, the cluster is gone and it would need > to do bootstrap again to make a new galera cluster. In this case, per Ryan, > Pacemaker/resource agent isn't able to do a whole new bootstrap process from > scratch when one of the involved nodes can't come up at all, and we think > this might be normal (adding dvossel to confirm). The only ways to get > this up again are 1. restore node 1 before attempting to restart the entire > cluster or 2. remove node 1 from the pacemaker cluster entirely before > trying to restart. A customer would likely do #1. I don't know that > people expect Pacemaker to build a complete cluster from scratch when one of > the nodes is not able to start. In the event of a complete failure, the galera resource-agent can only automate the bootstrap process if all the galera instances have reported in their last known sequence number. Right now, in order to report the last known sequence number each node must be able to start galera in read-only mode. If a node is incapable of starting galera in read only mode, the cluster is expected to block waiting for manual intervention. > > As far as the frozen state of node #1, it needs to be resolved manually; > resetting grastate.dat *may* work, if it causes galera to do a full SST > before it gets to the InnoDB part of it, as the corrupted InnoDB files would > be replaced by rsync.
Hi Asaf - Everything we see so far indicates the first node went down at 14:59, failed to come back, and the cluster proceeded normally until 15:41. We need to get information on what exactly happened at 15:41, was the system manually rebooted or restarted in some way? The "last -x" logs on 192.168.0.3 show this: root pts/0 192.168.0.1 Wed Mar 18 15:45 - 15:47 (00:02) runlevel (to lvl 3) 3.10.0-229.el7.x Wed Mar 18 15:44 - 00:09 (08:25) reboot system boot 3.10.0-229.el7.x Wed Mar 18 15:43 - 00:09 (08:25) shutdown system down 3.10.0-229.el7.x Wed Mar 18 15:41 - 15:43 (00:01) root pts/0 192.168.0.1 Wed Mar 18 15:41 - down (00:00) root pts/1 192.168.0.1 Wed Mar 18 14:10 - down (01:30) root pts/1 192.168.0.1 Wed Mar 18 14:07 - 14:08 (00:01) and in the mysql log we see MySQL shutting off: 150318 15:41:43 [Note] /usr/libexec/mysqld: Normal shutdown we then do not see MySQL on .03 coming up again until 16:29: 150318 16:29:53 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql 150318 16:29:53 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.ZZFsn6' --pid-file='/var/lib/mysql/mac848f69fbc4c3.example.com-recover.pid' 150318 16:29:53 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295 150318 16:29:53 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295 150318 16:29:53 [Warning] Could not increase number of max_open_files to more than 1024 (request: 1835) on .04, we don't see any reboot occurring but we see MySQL being stopped at 15:42: 150318 15:42:06 [Note] /usr/libexec/mysqld: Normal shutdown 150318 15:42:06 [Note] WSREP: Stop replication 150318 15:42:06 [Note] WSREP: Closing send monitor... and it comes back immediately: 150318 15:42:11 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql 150318 15:42:11 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.cS5mmd' --pid-file='/var/lib/mysql/mac848f69fbc643.example.com-recover.pid' and another stop and start between 15:59 and 16:14: 150318 15:59:45 mysqld_safe mysqld from pid file /var/run/mysql/mysqld.pid ended 150318 16:14:46 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql each of these restarts/reboots occurs 40 minutes after anything regarding the first node has happened. Can we please get complete detail over all steps taken in order to initiate these sequences? We'd like to know these restarts occurred without intervention or if this is something you were doing on your end.
I've confirmed that the "--tc-heuristic-recover=rollback" option is sufficient to recover the single node from this case. I copied the /var/lib/mysql files over to a local VM here and reproduced the same error on startup: [root@rhel7-1 lib]# mysqld_safe 150318 18:27:56 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'. 150318 18:27:56 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql 150318 18:27:56 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.qGINfH' --pid-file='/var/lib/mysql/rhel7-1-recover.pid' 150318 18:27:59 mysqld_safe WSREP: Failed to recover position: ... 150318 18:27:58 InnoDB: Transaction contains changes to 1 rows 150318 18:27:58 InnoDB: 1 transactions in prepared state after recovery 150318 18:27:58 [Note] Found 1 prepared transaction(s) in InnoDB 150318 18:27:58 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions. 150318 18:27:58 [ERROR] Aborting then used the option and it was able to recover past this: [root@rhel7-1 lib]# mysqld_safe --tc-heuristic-recover=rollback 150318 18:28:41 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'. 150318 18:28:41 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql 150318 18:28:41 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.gUbSQo' --pid-file='/var/lib/mysql/rhel7-1-recover.pid' 150318 18:28:43 mysqld_safe WSREP: Recovered position 68d12982-c254-11e4-a0e8-a6fbd098225f:2637841 the server acted more or less normally after that.
all the restart have done manually by me. Each reboot was during a rally task run from another server on that setup. I got into this state after I used constant load of creating instances on it and not by using rps(runs per second) as i used from the beginning of the test.
(In reply to Asaf Hirshberg from comment #41) > all the restart have done manually by me. Each reboot was during a rally > task run from another server on that setup. I got into this state after I > used constant load of creating instances on it and not by using rps(runs per > second) as i used from the beginning of the test. Asaf, We need a complete list of all reboots initiated with the times for each reboot and purpose of the reboot. For example: 14:50: reboot of 192.168.0.2 via ssh session as root from Rally 15:41: reboot of 192.168.0.3 via ssh session as root from Asaf Something along these lines. Please get us this info, since right now we're struggling with understanding why these nodes are rebooting. The level of detail you're providing (in general) in this bug report isn't providing enough detail to really help us puzzle out the sequence of events here. Comments like "all the restart have done manually by me" are too vague since (as noted above) we need to know the exact details of each restart (when it was done and why). Please keep this in mind when assisting us in determining root cause for complex bugs like this. If the 0.3 and 0.4 controller nodes indeed _did_ reboot at approx 15:41, while the 0.2 node was still in an unrecovered state (stuck pending a transaction), then effectively at 15:41 you've taken down all 3 nodes of a 3 node Galera cluster, rendering the cluster inoperable. As dvossel clarified in Comment #38, if you take down all 3 nodes cleanly, the Pacemaker Galera Resource Agent can bootstrap the cluster, but if you take down all 3 nodes and one of the nodes cannot start mariadb, then it is not possible to bootstrap the cluster from scratch. Most automated HA systems can handle one or two concurrent errors and still maintain availability. But, in the presence of multiple errors, (rebooting all 3 nodes + a transaction stuck on one node) we just do not have a way to handle that. It's an expected failure scenario that would result in needing manual intervention to recover. What we still need to figure out then is: * When all the reboots were and why * What caused the XA transaction to be stuck on 0.2 node during the first reboot at approx 14:50
Ok, I was chatting with oblaut and he cleared up some confusion for me. Rally itself does not initiate any reboots as part of it's testing. When Asaf says above "Each reboot was during a rally task run from another server on that setup" what he means is that while Rally was running, he manually SSH'd into the respective nodes and issued a reboot command. So, I think we now have one item left to focus on. Namely: why did the initial reboot of 0.2 node cause a transaction to get stuck in mariadb, which prevented the Galera cluster from healing back to a full 3 node cluster. This is the question that mbayer/dvossel/abeekhof will focus on answering over the next few days, with help from Asaf.
Hey Perry - The issue with the 0.2 node I believe is something that can just happen on rare occurrences, as described in this blog post: http://harrison-fisk.blogspot.com/2009/01/xa-and-persistent-innodb-locks.html. I do have the mysql datafiles that cause this corruption captured, so we can reproduce the condition in that way (actually let me attach them).
ok perry understood, I started from scratch, installed a new setup and start testing,here is what i did on the setup:(fyi,controllers ip have changed) .each reboot done by me using reboot command ssh root.0.3 ./neutron-install tenant #script that create some tenants&router&networksun -------------- ssh root@rally rall-v task start constant.json { "NovaServers.boot_and_delete_server": [ { "args": { "sleep": 5, "flavor": { "name": "nano" }, "image": { "name": "cirros" }, "force_delete": false }, "runner": { "type": "constant", "times": 1000, "concurrency": 40 }, "context": { "users": { "tenants": 1, "users_per_tenant": 1 }, "quotas": { "nova": { "instances": -1, "cores": -1, "ram": -1 }, "neutron": { "port": -1, "network": -1, "router": -1, "floatingip": -1, "security_group": -1 } } } } ] } -------------- ssh root.0.3 reboot #after server up [root@mac848f69fbc4c3 ~]#pcs status Master/Slave Set: galera-master [galera] Masters: [ pcmk-mac848f69fbc643 pcmk-macf04da2732fb1 ] Stopped: [ pcmk-mac848f69fbc4c3 ] [root@mac848f69fbc4c3 ~]# mysql ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2) [root@mac848f69fbc4c3 ~]# date Thu Mar 19 14:08:13 IST 2015 [root@mac848f69fbc4c3 ~]# pcs status Cluster name: openstack Master/Slave Set: galera-master [galera] Masters: [ pcmk-mac848f69fbc4c3 pcmk-mac848f69fbc643 pcmk-macf04da2732fb1 ] --------------- ssh root@rally rall-v task start constant.json -------------- [root@mac848f69fbc4c3 ~]# date;reboot Thu Mar 19 14:22:13 IST 2015 [root@mac848f69fbc4c3 ~]# date;mysql Thu Mar 19 14:28:12 IST 2015 ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2) [root@mac848f69fbc4c3 ~]# pcs status Cluster name: openstack Master/Slave Set: galera-master [galera] Masters: [ pcmk-mac848f69fbc643 pcmk-macf04da2732fb1 ] Stopped: [ pcmk-mac848f69fbc4c3 ] [root@mac848f69fbc4c3 ~]# clustercheck HTTP/1.1 503 Service Unavailable Content-Type: text/plain Connection: close Content-Length: 36 Galera cluster node is not synced. --------------- ssh root@rally rall-v task start constant.json -------------- [root@staypuft ~]# ssh root.0.4 root.0.4's password: Last login: Thu Mar 19 14:42:56 2015 from 192.168.0.1 [root@mac848f69fbc643 ~]# reboot #the server stayed down after the reboot and didn't start, had to start using the kvm [root@mac848f69fbc643 ~]# date;mysql Thu Mar 19 15:09:23 IST 2015 ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111) [root@mac848f69fbc643 ~]# Master/Slave Set: galera-master [galera] Masters: [ pcmk-mac848f69fbc4c3 pcmk-macf04da2732fb1 ] Stopped: [ pcmk-mac848f69fbc643 ] [root@macf04da2732fb1 ~]# date Thu Mar 19 15:24:38 IST 2015 [root@macf04da2732fb1 ~]# [root@macf04da2732fb1 ~]# reboot Connection to 192.168.0.5 closed by remote host. #the server stayed down after the reboot and didn't start, had to start using the kvm --------------- ssh root@rally rall-v task start constant.json -------------- [root@mac848f69fbc643 ~]# reboot Connection to 192.168.0.4 closed by remote host. Connection to 192.168.0.4 closed. [root@staypuft ~]# ssh root.0.4 root.0.4's password: Last login: Thu Mar 19 15:40:05 2015 from 192.168.0.1 [root@mac848f69fbc643 ~]# date;mysql Thu Mar 19 15:45:23 IST 2015 ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2) [root@mac848f69fbc643 ~]# --------------- ssh root@rally rall-v task start constant.json -------------- [root@staypuft ~]# ssh root.0.3 root.0.3's password: Last login: Thu Mar 19 14:28:04 2015 from 192.168.0.1 [root@mac848f69fbc4c3 ~]# reboot Connection to 192.168.0.3 closed by remote host. Connection to 192.168.0.3 closed. [root@staypuft ~]# ssh root.0.3 root.0.3's password: Last login: Thu Mar 19 16:05:03 2015 from 192.168.0.1 [root@mac848f69fbc4c3 ~]# mysql ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111) ip-galera-pub-10.35.173.156 (ocf::heartbeat:IPaddr2): Stopped Master/Slave Set: galera-master [galera] galera (ocf::heartbeat:galera): FAILED pcmk-macf04da2732fb1 Stopped: [ pcmk-mac848f69fbc4c3 pcmk-mac848f69fbc643 ] [root@mac848f69fbc4c3 ~]# date Thu Mar 19 16:17:15 IST 2015 [root@mac848f69fbc643 ~]# mysql ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2) [root@mac848f69fbc643 ~]# exit logout Connection to 192.168.0.4 closed. [root@staypuft ~]# ssh root.0.5 root.0.5's password: Last login: Thu Mar 19 16:21:15 2015 from 192.168.0.1 [root@macf04da2732fb1 ~]# mysql ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2) [root@macf04da2732fb1 ~]# exit logout Connection to 192.168.0.5 closed. [root@staypuft ~]# ssh root.0.4 root.0.4's password: Last login: Thu Mar 19 16:25:27 2015 from 192.168.0.1 [root@mac848f69fbc643 ~]# mysql ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2) [root@mac848f69fbc643 ~]# exit logout Connection to 192.168.0.4 closed. [root@staypuft ~]# ssh root.0.3 root.0.3's password: Last login: Thu Mar 19 16:10:39 2015 from 192.168.0.1 [root@mac848f69fbc4c3 ~]# mysql Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 28 Server version: 5.5.40-MariaDB-wsrep Copyright (c) 2000, 2014, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [(none)]>
ok attachment is too large to post. that said, the blog post above refers to an issue that was fixed in mysql 5.5.0, years ago. Googling for this issue pretty much otherwise turns up percona-related hits. So this could be something that these days is specific to galera and perhaps could be a fixable bug.
the above steps in comment 45 are still hard to follow, as there are missing steps (there's an ssh to root.0.4, and then a few lines down we suddenly are on another host, "[root@macf04da2732fb1 ~]# reboot Connection to 192.168.0.5 closed by remote host.") and scant timestamps as to what is happening when. As close as I can tell, the reboot at 14:28 was normal, the "pcs status" call shows that pcmk-mac848f69fbc4c3 didn't yet have time to join the cluster. then the rally task is started (which is still... while pcmk-mac848f69fbc4c3 is joining? or was this 10 minutes later?). Then we see the reboot of mac848f69fbc643, and then we see "the server stayed down after the reboot and didn't start, had to start using the kvm". What's *that* about? The whole OS is failing? From that point on, the mariadb on that node is in the failed state that we still don't understand. for the moment, I'm going to keep peering into the fixed "failed" files I have to try to find anything local to Galera /MariaDB as far as that individual state.
for the last log, AFAICT the MySQL instance here was either shutdown by pacemaker or somehow crashed. on macf04da2732fb1, reboot is called. on mac848f69fbc643, pacemaker logs seem to indicate this being seen: Mar 19 15:24:42 [3365] mac848f69fbc643.example.com attrd: info: attrd_peer_update: Setting shutdown[pcmk -macf04da2732fb1]: 0 -> 1426771482 from pcmk-macf04da2732fb1 Mar 19 15:24:43 [3367] mac848f69fbc643.example.com crmd: info: action_synced_wait: Managed galera_meta-data_0 process 30441 exited with rc=1 then on the mac848f69fbc643 machine, galera picks up that the node is down normally: 150319 15:24:49 [Note] WSREP: forgetting b112d756-ce22-11e4-ae6e-86b089817ae9 (ssl://192.168.0.5:4567) 150319 15:24:49 [Note] WSREP: deleting entry ssl://192.168.0.5:4567 150319 15:24:49 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2 ... 150319 15:24:54 [Note] WSREP: cleaning up b112d756-ce22-11e4-ae6e-86b089817ae9 (ssl://192.168.0.5:4567) took out the .5 node normally. Then nothing for four minutes, until.. ??? *** need to know what happened here, the mac848f69fbc643 was not rebooted but mariadb seems to have stopped *** Now we're restarting, and we start up again, and we're into the bad XA state. 150319 15:28:17 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql 150319 15:28:17 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.5p19Qf' --pid-file='/var/lib/mysql/mac848f69fbc643.example.com-recover.pid' there's a lot of pacemaker logs on mac848f69fbc643 during the 15:24 period but i don't know if they refer to Galera being affected. Can dvossel check the logs on mac848f69fbc643 to determine this ?
(In reply to Michael Bayer from comment #49) > for the last log, AFAICT the MySQL instance here was either shutdown by > pacemaker or somehow crashed. I'm fairly certain mysqld crashed. explanation below. > on macf04da2732fb1, reboot is called. on mac848f69fbc643, pacemaker logs > seem to indicate this being seen: > > Mar 19 15:24:42 [3365] mac848f69fbc643.example.com attrd: info: > attrd_peer_update: Setting shutdown[pcmk > -macf04da2732fb1]: 0 -> 1426771482 from pcmk-macf04da2732fb1 > > Mar 19 15:24:43 [3367] mac848f69fbc643.example.com crmd: info: > action_synced_wait: Managed galera_meta-data_0 process 30441 exited with > rc=1 Don't worry about this meta-data action. I don't know why it is failing but it has nothing to do with the problem at hand. > > then on the mac848f69fbc643 machine, galera picks up that the node is down > normally: > > 150319 15:24:49 [Note] WSREP: forgetting > b112d756-ce22-11e4-ae6e-86b089817ae9 (ssl://192.168.0.5:4567) > 150319 15:24:49 [Note] WSREP: deleting entry ssl://192.168.0.5:4567 > 150319 15:24:49 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, > my_idx = 0, memb_num = 2 > ... > > 150319 15:24:54 [Note] WSREP: cleaning up > b112d756-ce22-11e4-ae6e-86b089817ae9 (ssl://192.168.0.5:4567) > > took out the .5 node normally. > > > Then nothing for four minutes, until.. ??? right. this gap is where mysqld goes silent. === Sequence of events. === 15:23:10 START ENDS - successfully started galera in read-only. 15:23:12 PROMOTE BEGINS 15:23:13 shutdown of read only succeeds 15:23:19 initiate start of galera in cluster mode 15:23:26 INFO: Waiting for database to sync with the cluster. ^^^ here the agent is waiting for galera to successfully start and sync up with the galera cluster. 15:24:54 mysql logs go silent. ^^^ that's got to be the mysqld crash 15:28:12 - PROMOTE ENDS - Timed Out waiting for mysql to sync with the cluster... never happened because mysqld disappeared. 15:28:21 - PROMOTE FAILS again ^^^ someone had to have manually done a promote using pcs and that debug mode or something. 15:42:58 - MONITOR - reports galera isn't running. 15:43:04 - START - fails to start because of corrupted recovery information. Galera is hosed after the crash at 15:24:54 on this node until manual intervention. -- David
well then that's the next mystery. Why is that same mysqld being started at 15:23? Again we see it go silent at 14:41, and starting up at 15:23: 150319 14:41:37 [Note] WSREP: Member 1.0 (mac848f69fbc643.example.com) synced with group. 150319 14:41:37 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 13384) 150319 14:41:37 [Note] WSREP: Synchronized with group, ready for connections 150319 14:41:37 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 150319 14:41:37 [Note] WSREP: Member 0.0 (mac848f69fbc4c3.example.com) synced with group. * silent * 150319 15:23:02 mysqld_safe Starting mysqld daemon with databases from /var/lib /mysql 150319 15:23:02 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.XKQLh6' --pid-file='/var/lib/mysql/mac848f69fbc643.example.com-recover.pid'
hi Asaf - We're going to need very exact steps so that I can log into a brand new environment like this and run exactly the same series of commands. We need to be able to run each step individually and examine all logs in between each step. Can I either use this environment, or can you set up a new one, and send us a series of exact commands (1. 2. 3. 4. etc. ) to produce this sequence? thanks.
Created attachment 1006060 [details] Python script to create database load
a related issue posted for mariadb-galera: just refers to how the node that's crashed in this particular state (describing it as "the innodb tablespace is damaged") only means that single node can't automatically rejoin the cluster: https://mariadb.atlassian.net/browse/MDEV-6660
So, after a week of trying to make this thing happen, at least as far as what we see in the logs I attached above, it is fairly clear what happened there from a galera/pacemaker perspective. Looking over those logs again I can play out just what happened and I can also do the same thing locally, and it is what we theorized in comment #38. given 192.168.1.2 in mysqld.2.log, .3 in mysqld.3.log, mysqld.4.log: 1. Let's first assume that the QE environment, which has been demonstrated to have some kind of issue with the "reboot" command, is on some occasions of calling "reboot" crashing the server, rather than gracefully rebooting it. In .2, we see that between 14:54 and 15:04, a crash and a restart occurs: 150318 14:54:12 [Note] WSREP: Member 1.0 (mac848f69fbc643.example.com) synced with group. 150318 15:04:05 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql 2. Nodes .3 and .4 adjust to this crash just fine: 192.168.0.3: 150318 14:59:43 [Note] WSREP: forgetting 2db36c8e-cd63-11e4-aec4-369b15ae4055 (ssl://192.168.0.2:4567) 192.168.0.4: 150318 14:59:43 [Note] WSREP: forgetting 2db36c8e-cd63-11e4-aec4-369b15ae4055 (ssl://192.168.0.2:4567) 3. 192.168.1.2 crashed bad enough that the InnoDB files are corrupted, hence we see the issue described at https://mariadb.atlassian.net/browse/MDEV-6660. By 15:04, the .2 node is done and never heard from again since it cannot start: 150318 15:04:08 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions. 150318 15:04:08 [ERROR] Aborting 150318 15:04:08 InnoDB: Starting shutdown... 150318 15:04:10 InnoDB: Shutdown completed; log sequence number 2013919992 150318 15:04:10 [Note] /usr/libexec/mysqld: Shutdown complete' 4. From here on, we're only with *two* nodes. Yet the shutdowns of MySQL databases seems to continue. .3 is shutdown at 15:41: 150318 15:41:43 [Note] /usr/libexec/mysqld: Normal shutdown .4 adapts and realizes there's no more cluster; .3 is considered to be "partitioned": 150318 15:41:58 [Note] WSREP: view(view_id(NON_PRIM,9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3,14) memb { 9d004ec7-cd6d-11e4-a36d-d2cd1c331ef3,0 } joined { } left { } partitioned { fc7641ec-cd65-11e4-b054-37790c49e075,0 }) 150318 15:41:58 [Note] WSREP: New cluster view: global state: 68d12982-c254-11e4-a0e8-a6fbd098225f:2642150, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3 5. Then at 15:42 we see node .4 is shutdown. .3 is still shut down as well: 150318 15:42:06 [Note] /usr/libexec/mysqld: Normal shutdown The cluster is now gone. There's no database turned on at all. 6. .4 comes back at 15:42, to an empty cluster: 150318 15:42:11 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql and starts in normal non-clustered mode - Pacemaker can't do anything here: 150318 15:42:14 [Note] Event Scheduler: Loaded 0 events 150318 15:42:14 [Note] /usr/libexec/mysqld: ready for connections. Version: '5.5.40-MariaDB-wsrep' socket: '/var/lib/mysql/mysql.sock' port: 3306 MariaDB Server, wsrep_25.11.r4026 7. .3 is finally brought back at 16:29: 150318 16:29:53 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql and also comes back in non-cluster mode - because as is expected, Pacemaker can't build a cluster while .2 is still down: 150318 16:29:57 [Note] Event Scheduler: Loaded 0 events 150318 16:29:57 [Note] /usr/libexec/mysqld: ready for connections. Version: '5.5.40-MariaDB-wsrep' socket: '/var/lib/mysql/mysql.sock' port: 3306 MariaDB Server, wsrep_25.11.r4026 And that's what happened! What isn't clear is how these MySQL databases were shut off, or if they crashed, or what. But there's nothing here to show that anything happened to them other than the machines were being constantly rebooted on an environment that has known issues with clean reboots. Googling around for the issue regarding the "tc-heuristic-recover" issue turns up quite a few results, all of which involve MariaDB-Galera / Percona. This seems to be a thing that can happen. However none of these issues or blog posts say anything about this situation being linked to other nodes in the Galera cluster having problems. After staring at this for two weeks and getting a lot more comfortable watching galera and pacemaker restart over and over again, I'm not finding evidence for anything unusual within the scope of these tools.
okay, in short. 1. galera crashes on one node1 2. node2 and node3 gracefully reboot 3. galera cluster can't start again on node2 and node3 because node1 data is corrupted from crash. (when all galera instances are down, pacemaker has to see all galera instances in read-only before pacemaker lets the galera cluster reform) In this condition, we expect pacemaker to block starting of the galera cluster until manual intervention takes place to figure out what is going on with the data corruption. If we determine there is a safe way to automate proceeding with recovery here, I can add that procedure to the agent. We've known that galera theoretically had the potential to block automated recovery, but now that I'm seeing this blocking behavior reliably reproduced it's kind of troubling. -- David
So we're not able to reproduce this in any way other than what would normally be expected if one of the Galera nodes crashes, and the logfiles we've observed all seem to indicate that the cluster stays running fine when the first node is intentionally crashed out. The other nodes come back without issue, and only if the remaining two are stopped at the same time does Pacemaker refuse to join them back to a cluster. No logs indicate other galera nodes having any kind of problem when the initial node goes down and this hasn't been reproduced in isolation after much trying, nor are there reports in the wild of anything similar. There is some discussion about the "server won't reboot" issue on other channels but that should be opened as a separate issue.
Work around to recover for mariadb failure is suggested in https://access.redhat.com/solutions/1283463
I tried to reproduced it again, I used 3 controllers, 8 compute setup (physical). I have ran Rally "boot-run-command" and rebooted over and over the same controller (over 10 times) and every time galera and the rest of the cluster where managed to recover without any problem.
Work around link https://access.redhat.com/solutions/1283463 isn't a working. I am hiting this issue again and again. Can someone provide the workaroud link.