Bug 1235458

Summary:

[mariadb-galera]: 'power-off node' corrupt the DB, the corruption prevent galera to start and as result the node won't be part of cluster after the node is powered back on.

Product:

Red Hat OpenStack

Reporter:

Omri Hochman <ohochman>

Component:

mariadb-galera

Assignee:

Michael Bayer <mbayer>

Status:

CLOSED NOTABUG

QA Contact:

yeylon <yeylon>

Severity:

high

Docs Contact:

Priority:

high

Version:

7.0 (Kilo)

CC:

jschluet, lhh, mcornea, rohara, srevivo, yeylon

Target Milestone:

---

Keywords:

Reopened

Target Release:

8.0 (Liberty)

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-06-25 17:18:49 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
mariadb.log	none

Description Omri Hochman 2015-06-24 21:09:27 UTC

[mariadb-galera]: 'power-off node' corrupt the DB,  the corruption prevent galera to start and as result the node won't be part of cluster after the node is powered back on.   

Environment: 
-------------
mariadb-galera-server-5.5.41-2.el7ost.x86_64
galera-25.3.5-7.el7ost.x86_64
mariadb-galera-common-5.5.41-2.el7ost.x86_64
instack-0.0.7-1.el7ost.noarch
instack-undercloud-2.1.2-4.el7ost.noarch
openstack-heat-common-2015.1.0-4.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-15.el7ost.noarch
openstack-heat-api-cloudwatch-2015.1.0-4.el7ost.noarch
heat-cfntools-1.2.8-2.el7.noarch
python-heatclient-0.6.0-1.el7ost.noarch
openstack-heat-api-2015.1.0-4.el7ost.noarch
openstack-heat-templates-0-0.6.20150605git.el7ost.noarch
openstack-heat-engine-2015.1.0-4.el7ost.noarch
openstack-heat-api-cfn-2015.1.0-4.el7ost.noarch
openstack-puppet-modules-2015.1.7-2.el7ost.noarch
puppet-3.6.2-2.el7.noarch
openstack-tripleo-puppet-elements-0.0.1-2.el7ost.noarch


Steps:
--------
1) Install HA deployment using rhel-osp-director  ( 3 X controllers)  
2) After overcloud deployment finish --> Attempt to power-off node using ironic
 
  2a. : ironic node-list 
  2b. : ironic node-set-power-state <controller_0_uuid> off
  2c. : ironic node-set-power-state <controller_0_uuid> on

3) ssh  controller_0_uuid  
  3a.  grep ERROR in /var/log/mariadb/mariadb.log
  3b.  sudo pcs status  


Results: 
---------
1)  grep ERROR in /var/log/mariadb/mariadb.log  shows corruption in DB 
2)  pcs status --> shows that galera failed to start . 
3)  as result the host won't be part of HA cluster in case of Power-outage or hard-shutdown.  

[root@overcloud-controller-0 heat-admin]# grep ERROR in /var/log/mariadb/mariadb.log
grep: in: No such file or directory
/var/log/mariadb/mariadb.log:150624 15:47:16 [ERROR] mysqld: Table './mysql/user' is marked as crashed and should be repaired
/var/log/mariadb/mariadb.log:150624 15:47:16 [ERROR] mysql.user: 1 client is using or hasn't closed the table properly
/var/log/mariadb/mariadb.log:150624 15:47:16 [ERROR] mysqld: Table './mysql/db' is marked as crashed and should be repaired
/var/log/mariadb/mariadb.log:150624 15:47:16 [ERROR] mysql.db: 1 client is using or hasn't closed the table properly


pcs status 

Failed actions:
    openstack-heat-engine_start_0 on overcloud-controller-2 'not running' (7): call=316, status=complete, exit-reason='none', last-rc-change='Wed Jun 24 15:51:39 2015', queued=2000ms, exec=6ms
    openstack-cinder-volume_start_0 on overcloud-controller-2 'not running' (7): call=314, status=complete, exit-reason='none', last-rc-change='Wed Jun 24 15:50:47 2015', queued=2001ms, exec=4ms
    galera_start_0 on overcloud-controller-0 'unknown error' (1): call=216, status=Timed Out, exit-reason='none', last-rc-change='Wed Jun 24 15:47:30 2015', queued=0ms, exec=120003ms
    redis_start_0 on overcloud-controller-0 'unknown error' (1): call=219, status=complete, exit-reason='none', last-rc-change='Wed Jun 24 15:49:35 2015', queued=0ms, exec=21910ms
    openstack-nova-scheduler_start_0 on overcloud-controller-0 'not running' (7): call=236, status=complete, exit-reason='none', last-rc-change='Wed Jun 24 15:50:30 2015', queued=2001ms, exec=2ms
    openstack-heat-engine_start_0 on overcloud-controller-0 'not running' (7): call=268, status=complete, exit-reason='none', last-rc-change='Wed Jun 24 15:51:31 2015', queued=2001ms, exec=2ms
    openstack-nova-consoleauth_start_0 on overcloud-controller-0 'not running' (7): call=238, status=complete, exit-reason='none', last-rc-change='Wed Jun 24 15:50:34 2015', queued=2001ms, exec=5ms
    openstack-cinder-api_start_0 on overcloud-controller-0 'not running' (7): call=242, status=complete, exit-reason='none', last-rc-change='Wed Jun 24 15:50:39 2015', queued=2002ms, exec=5ms
    neutron-server_start_0 on overcloud-controller-0 'not running' (7): call=246, status=complete, exit-reason='none', last-rc-change='Wed Jun 24 15:50:46 2015', queued=2001ms, exec=3ms
    openstack-heat-engine_start_0 on overcloud-controller-1 'not running' (7): call=327, status=complete, exit-reason='none', last-rc-change='Wed Jun 24 15:51:35 2015', queued=2002ms, exec=2ms
    openstack-cinder-volume_start_0 on overcloud-controller-1 'not running' (7): call=325, status=complete, exit-reason='none', last-rc-change='Wed Jun 24 15:50:41 2015', queued=2001ms, exec=2ms

Comment 3 Omri Hochman 2015-06-24 21:11:32 UTC

Created attachment 1042832 [details]
mariadb.log

full mariadb.log

Comment 4 Michael Bayer 2015-06-24 21:20:03 UTC

my initial take on this as discussed on IRC is that the power off is causing a single mariadb node to become corrupted in some way.  The established approach to re-introducing a failed node into the cluster is to repair the corruption on the failed node first.

In this case, the tables that are impacted are the MyISAM tables "user" and "db", which aren't handled by Galera.   The operator should log into the console for this database specifically and run the "REPAIR TABLE" command, documented at https://dev.mysql.com/doc/refman/5.1/en/repair-table.html.  E.g. this corruption is not a bug; it is a known behavior of MySQL/MariaDB with an established solution.

At that stage, the node should be able to rejoin the cluster, perhaps after a restart.  Deleting grastate.dat will ensure the node does a full SST when it rejoins.


From my POV this is a "worksforme".

Comment 5 Michael Bayer 2015-06-24 22:23:51 UTC

confirm that the mysql.user and mysql.db tables are in fact MyISAM, and that these are not replicated by galera.  

https://mariadb.com/kb/en/mariadb/mariadb-galera-cluster-known-limitations/

"Currently replication works only with the InnoDB storage engine. Any writes to tables of other types, including system (mysql.*) tables are not replicated (this limitation excludes DDL statements such as CREATE USER, which implicitly modify the mysql.* tables — those are replicated). There is however experimental support for MyISAM - see the wsrep_replicate_myisam system variable) "


This is only significant since we can confirm that these two tables are corrupted in an ordinary way.   Just run "REPAIR TABLE" on them and restart.

Comment 6 Michael Bayer 2015-06-24 22:38:26 UTC

please reopen if the resolution doesn't work for you or there are other compounding factors, thanks!

Comment 7 Marius Cornea 2015-06-25 14:01:08 UTC

I tried to recover the broken node from such condition and encountered some factors that required workarounds. End results was the node rejoined the cluster but can you confirm the steps I followed are correct? 

I couldn't login to the console since the server was only accepting connections form localhost on via file socket:

[root@overcloud-controller-0 ~]# mysql
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
[root@overcloud-controller-0 ~]# ps axu | grep sql
mysql     1342  0.0  0.0 115348  1700 ?        Ss   08:35   0:00 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
mysql     3583  0.1  1.2 1490632 104136 ?      Sl   08:35   0:01 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --wsrep-provider=/usr/lib64/galera/libgalera_smm.so --log-error=/var/log/mariadb/mariadb.log --open-files-limit=-1 --pid-file=/var/run/mariadb/mariadb.pid --socket=/var/lib/mysql/mysql.sock --port=3306 --wsrep_start_position=f66ffed3-1b32-11e5-a6b0-b25242c6d09d:4365
root     25412  0.0  0.0 112644   928 pts/0    S+   08:50   0:00 grep --color=auto sql

To workaround this I added 'innodb_force_recovery = 1' to /etc/my.cnf.d/galera.cnf
and restarted the mariadb service.

After this I could access the console and run the repair steps:
SET wsrep_on=OFF;
repair table mysql.user;
repair table mysql.db;

Next I run:
systemctl stop mariadb
rm /var/lib/mysql/grastate.dat
remove 'innodb_force_recovery = 1' from /etc/my.cnf.d/galera.cnf
crm_resource -C -r galera-master

After running these steps the node rejoined the cluster.Can you please confirm the steps I did are correct? Thanks.

Comment 8 Michael Bayer 2015-06-25 14:29:19 UTC

absolutely, once a *single* mariadb node is corrupted or otherwise unable to start, the general steps are:

1. get that mariadb node to start as an independent database service first.   Any special commands or startup recovery flags needed to make this happen are OK.   It is even usually OK to just rebuild the data directory of this node completely fresh to start with, because it will be getting all the current data and user accounts from the other nodes when it rejoins the cluster in any case.  E.g. there's nothing you need to preserve there, consider if you built a brand new MariaDB, that brand new service could join into your cluster just as easily (*as long as the rest of the cluster is still running fine*.  If you've lost all nodes or some corruption has occurred to all/most of them, that's a different ballgame.)

2. delete grastate.dat on the node that had a problem. This effectively means it will unconditionally get all of its data replaced by the other nodes when it rejoins the cluster.  E.g. all the InnoDB datafiles that we possibly were worried about in step #1 are going to get overwritten via an rsync in any case.

3. restart the node again and it should join the cluster and be synchronized.

Comment 9 Lon Hohberger 2015-06-25 17:21:36 UTC

In a HA environment, it's not the case that the node that failed is guaranteed to be clean and able to rejoin the cluster (or even boot) - the HA software is there to provide continuity of service from the other hosts.

We should probably note these recovery steps somewhere.