Bug 1170376

Summary:	galera cluster silently remains out of sync after bootstrap.
Product:	Red Hat Enterprise Linux 7	Reporter:	David Vossel <dvossel>
Component:	resource-agents	Assignee:	Fabio Massimo Di Nitto <fdinitto>
Status:	CLOSED ERRATA	QA Contact:	Ofer Blaut <oblaut>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	7.2	CC:	agk, ahirshbe, cfeist, cluster-maint, dmaley, fdinitto, jherrman, mbayer, mnovacek, nbarcet, oalbrigt, oblaut, yeylon
Target Milestone:	pre-dev-freeze	Keywords:	ZStream
Target Release:	7.2
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	resource-agents-3.9.5-43.el7	Doc Type:	Bug Fix
Doc Text:	Prior to this update, a Galera cluster in certain circumstances became unsynchronized without displaying any error messages. Consequently, the nodes on the cluster replicated successfully, but the data was not identical. This update adjusts resource-agents to no longer use "read-only" mode with Galera clusters, which ensures that these clusters do not silently fail to synchronize.	Story Points:	---
Clone Of:
Clones:	1242339 (view as bug list)		Environment:
Last Closed:	2015-11-19 04:40:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1240394, 1242339

Description David Vossel 2014-12-03 22:47:41 UTC

Description of problem:

We have a test case where a galera cluster appears to get into a state where all the nodes think the are in sync, but they are in fact not. The result is all the nodes are happily replicating, but not all the data is the same.


Version-Release number of selected component (if applicable):

galera.x86_64                      25.3.5-5.el7ost               @OpenStack-5.0-RHEL-7-Puddle
mariadb-galera-common.x86_64       1:5.5.37-7.el7ost             @OpenStack-5.0-RHEL-7-Puddle
mariadb-galera-server.x86_64       1:5.5.37-7.el7ost             @OpenStack-5.0-RHEL-7-Puddle

How reproducible:
We have a deployment that reproduces this 100% of the time. I am unsure how reliable reproducing outside of the environment will be. This could be timing related.

Steps to Reproduce:

TEST

Force reboot 2 out of 3 nodes while DB is flooded:


[root@rhos5-db3 ~]# for i in $(seq 1 2000); do mysql mysql -e "CREATE
DATABASE test$i;"; done

EXPECTED ERRORS DUE TO GALERA DROPPING INTERNAL QUORUM:

[root@rhos5-db3 ~]# for i in $(seq 1 2000); do mysql mysql -e "CREATE
DATABASE test$i;"; done
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock;
try restarting transaction
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial
communication packet', system error: 104
ERROR 2002 (HY000): Can't connect to local MySQL server through socket
'/var/lib/mysql/mysql.sock' (2)
.....

BEFORE galera is up and running on the nodes check:

cd /var/lib/mysql/
ls -lasd test* | wc -l

[root@rhos5-db3 mysql]# ls -lasd test* | wc -l
168

[root@rhos5-db1 mysql]# ls -lasd test* | wc -l
1

[root@rhos5-db2 mysql]# ls -lasd test* | wc -l
86

wait for master election with crm_mon, the first master is rhos5-db3 (as
expected with the latest and greatest table).

Wait full master election:

 Master/Slave Set: galera-master [galera]
     Masters: [ rhos5-db1 rhos5-db2 rhos5-db3 ]

From here, it appears all the nodes are out of sync.  I investigated this more and found that I experienced the same behavior when manually bootstrapping the nodes using the systemd service.

Example: All galera instances are down, Bootstrapping using node rhos5-db3

- INITIATE BOOTSTRAP ON rhos5-db3

# service mariadb start
Redirecting to /bin/systemctl start  mariadb.service
# ls -lasd test* | wc -l
1702

- START GALERA ON rhos5-db2 and rhos5-db1

# service mariadb start
Redirecting to /bin/systemctl start  mariadb.service
# ls -lasd test* | wc -l
134

# service mariadb start
Redirecting to /bin/systemctl start  mariadb.service
# ls -lasd test* | wc -l
810

- RESULT: all nodes remain out of sync regardless if manually bootstrapped using systemd or started by galera resource-agent.

Comment 1 David Vossel 2014-12-03 22:49:43 UTC

Ryan and I have been discussing this on IRC. He suggested deleting the /var/lib/mysql/grastate.dat file and seeing what happens when I bring a galera node back online. When i delete the grastate file, the SST occurs and the node coming online is up-to-date with whatever donor it connected to.

-- David

Comment 3 Fabio Massimo Di Nitto 2014-12-04 05:13:09 UTC

(In reply to David Vossel from comment #1)
> Ryan and I have been discussing this on IRC. He suggested deleting the
> /var/lib/mysql/grastate.dat file and seeing what happens when I bring a
> galera node back online. When i delete the grastate file, the SST occurs and
> the node coming online is up-to-date with whatever donor it connected to.
> 
> -- David

This is a potential workaround, but the point is why all 3 nodes had the same state recorded even if they were out of sync?

Comment 4 David Vossel 2014-12-04 15:07:04 UTC

(In reply to Fabio Massimo Di Nitto from comment #3)
> (In reply to David Vossel from comment #1)
> > Ryan and I have been discussing this on IRC. He suggested deleting the
> > /var/lib/mysql/grastate.dat file and seeing what happens when I bring a
> > galera node back online. When i delete the grastate file, the SST occurs and
> > the node coming online is up-to-date with whatever donor it connected to.
> > 
> > -- David
> 
> This is a potential workaround, but the point is why all 3 nodes had the
> same state recorded even if they were out of sync?

I'm not sure it is a workaround. The grastate.dat file would need to have a seqno of -1 for it to make sense for us to delete it. Before we start galera on one of the nodes that doesn't update, grastate.data looks like this.


 cat /var/lib/mysql/grastate.dat 
# GALERA saved state
version: 2.1
uuid:    49ddc25c-fd17-11e3-80b9-76596d7aecbb
seqno:   18803526
cert_index:

Comment 5 Ryan O'Hara 2014-12-04 15:37:52 UTC

(In reply to David Vossel from comment #4)
> (In reply to Fabio Massimo Di Nitto from comment #3)
> > (In reply to David Vossel from comment #1)
> > > Ryan and I have been discussing this on IRC. He suggested deleting the
> > > /var/lib/mysql/grastate.dat file and seeing what happens when I bring a
> > > galera node back online. When i delete the grastate file, the SST occurs and
> > > the node coming online is up-to-date with whatever donor it connected to.
> > > 
> > > -- David
> > 
> > This is a potential workaround, but the point is why all 3 nodes had the
> > same state recorded even if they were out of sync?
> 
> I'm not sure it is a workaround. The grastate.dat file would need to have a
> seqno of -1 for it to make sense for us to delete it. Before we start galera
> on one of the nodes that doesn't update, grastate.data looks like this.
> 
> 
>  cat /var/lib/mysql/grastate.dat 
> # GALERA saved state
> version: 2.1
> uuid:    49ddc25c-fd17-11e3-80b9-76596d7aecbb
> seqno:   18803526
> cert_index:

Since the nodes are being shutdown cleanly they are recording uuid/seqno in grastate.dat successfully. It seems that when they comes back online they do not think there is anything to sync. This might have something to do with the fact that the CREATE DATABASE operation is non-transactional. Nothing is being written into these tables. The reason removing the grastate.dat file works is because it forces a full SST (rsync) of the databases because there is node state ID to consider.

This does seem odd. I wonder what happens if we try to write to a database that wasn't replicated on all nodes.

Comment 7 David Vossel 2014-12-04 16:30:35 UTC

Here are some more data points.

I ran the following on two nodes. Both nodes are in the galera cluster and both think they are in sync. In reality rhos5-db2 is out of sync, and rhos5-db1 is in sync.

cd /var/lib/mysql/
ls test1000
echo "use test1000;" | mysql
ls test1000
echo "use test1000; CREATE TABLE testtable (id INT NOT NULL PRIMARY KEY AUTO_INCREMENT, myval CHAR(25));" | mysql
ls test1000
echo "use test1000; INSERT INTO testtable (id, myval) VALUES (NULL, 'hooray2');" | mysql
ls test1000


- Writes on out of sync node rhos5-db2 (db test1000 doesn't exist)
[root@rhos5-db2 mysql]# ls test1000
ls: cannot access test1000: No such file or directory
[root@rhos5-db2 mysql]# cd /var/lib/mysql/
[root@rhos5-db2 mysql]# ls test1000
ls: cannot access test1000: No such file or directory
[root@rhos5-db2 mysql]# echo "use test1000;" | mysql
ERROR 1049 (42000) at line 1: Unknown database 'test1000'
[root@rhos5-db2 mysql]# ls test1000
ls: cannot access test1000: No such file or directory
[root@rhos5-db2 mysql]# echo "use test1000; CREATE TABLE testtable (id INT NOT NULL PRIMARY KEY AUTO_INCREMENT, myval CHAR(25));" | mysql
ERROR 1049 (42000) at line 1: Unknown database 'test1000'
[root@rhos5-db2 mysql]# ls test1000
ls: cannot access test1000: No such file or directory
[root@rhos5-db2 mysql]# echo "use test1000; INSERT INTO testtable (id, myval) VALUES (NULL, 'hooray2');" | mysql
ERROR 1049 (42000) at line 1: Unknown database 'test1000'
[root@rhos5-db2 mysql]# ls test1000
ls: cannot access test1000: No such file or directory
[root@rhos5-db2 mysql]# 
[root@rhos5-db2 mysql]# ls test1000
ls: cannot access test1000: No such file or directory

- Writes on synced node rhos5-db1 (db test1000 does exist)
[root@dvossel-laptop dvossel]# ssh rhos5-db1
Last login: Thu Dec  4 10:16:21 2014 from mrg-07.vmnet.mpc.lab.eng.bos.redhat.com
[root@rhos5-db1 ~]# cd /var/lib/mysql/
[root@rhos5-db1 mysql]# cd /var/lib/mysql/
[root@rhos5-db1 mysql]# ls test1000
db.opt
[root@rhos5-db1 mysql]# echo "use test1000;" | mysql
[root@rhos5-db1 mysql]# ls test1000
db.opt
[root@rhos5-db1 mysql]# echo "use test1000; CREATE TABLE testtable (id INT NOT NULL PRIMARY KEY AUTO_INCREMENT, myval CHAR(25));" | mysql
[root@rhos5-db1 mysql]# ls test1000
db.opt  testtable.frm
[root@rhos5-db1 mysql]# echo "use test1000; INSERT INTO testtable (id, myval) VALUES (NULL, 'hooray2');" | mysql
[root@rhos5-db1 mysql]# ls test1000
db.opt  testtable.frm


Now here's the funny part. When I wrote to the db on rhos5-db1 successfully, rhos5-db2 crashed. The crash actually causes rhos5-db2 to recover correctly when galera restarts.

-- David

Comment 8 Ryan O'Hara 2014-12-04 18:57:37 UTC

After a long discussion with David, we agree there are two issues.

The first is with the resource agent itself. Because the resource agent will first start mariadb in read-only mode in order to determine the seqno, this causes the seqno in grastate.dat to get reset to -1, which later causes the synchronization problem. David, correct me if I have mistated this. This would explain why we cannot recreate this problem when mariadb is started/stopped manually. I've proposed that the resource agent use 'mysqld --wsrep-recover' as a means to determin a node's position, which would remove the need to start mariadb in read-only mode. This would be an improvement, potentially much fast way to determine a node's position. David is investigating.

The second issue is why mariadb/galera allows a node that has a valud UUID but seqno set to -1 to join the cluster at all. According to [1] this indicates the node is in a bad state. We've observed that when grastate.dat is in this state the node will be allowed to join and report that it is sync'd eventhough no SST or IST occurs. It seems like the node should be prevented from joining, since once it is declared sync'd when it clearly is not (eg. databases are no copied as shown in description). In addition, if mariadb is stopped after this bogus sync, the seqno will be updated and appear to be in sync going forward, never synchronizing the databases. It appears that the only sane thing to do to prevent this is to delete grastate.dat to force a full SST, thus the node is actually in sync with the cluster.

Would be nice to get clarification on the second issue from partner.

Comment 9 David Vossel 2014-12-04 19:42:37 UTC

(In reply to Ryan O'Hara from comment #8)
> After a long discussion with David, we agree there are two issues.
> 
> The first is with the resource agent itself. Because the resource agent will
> first start mariadb in read-only mode in order to determine the seqno, this
> causes the seqno in grastate.dat to get reset to -1, which later causes the
> synchronization problem. David, correct me if I have mistated this. This
> would explain why we cannot recreate this problem when mariadb is
> started/stopped manually. I've proposed that the resource agent use 'mysqld
> --wsrep-recover' as a means to determin a node's position, which would
> remove the need to start mariadb in read-only mode. This would be an
> improvement, potentially much fast way to determine a node's position. David
> is investigating.
> 
> The second issue is why mariadb/galera allows a node that has a valud UUID
> but seqno set to -1 to join the cluster at all. According to [1] this
> indicates the node is in a bad state. We've observed that when grastate.dat
> is in this state the node will be allowed to join and report that it is
> sync'd eventhough no SST or IST occurs. It seems like the node should be
> prevented from joining, since once it is declared sync'd when it clearly is
> not (eg. databases are no copied as shown in description). In addition, if
> mariadb is stopped after this bogus sync, the seqno will be updated and
> appear to be in sync going forward, never synchronizing the databases.

This is ultimately the issue I'm concerned about here.  The galera cluster happily forms with nodes in various different states and synchronization never properly occurs. This should never be possible.
 
> It
> appears that the only sane thing to do to prevent this is to delete
> grastate.dat to force a full SST, thus the node is actually in sync with the
> cluster.

From what I gather, deleting the grastate.dat file means we are anticipating that galera might do something wrong, and attempting to trick galera into doing the right thing. Deleting this file may very well be part of the solution. This sort of logic belongs in the safe_mysqld startup script (which initializes galera) though.

> 
> Would be nice to get clarification on the second issue from partner.

agreed. The resource-agent optimization/fix is nice, but it doesn't guarantee there are not other ways galera could enter this unsynced state.

-- David

Comment 12 Michael Bayer 2015-03-16 22:38:17 UTC

so can this error state be forced by just taking a galera node, putting a good UUID + -1 seqno in grastate.dat and starting?   I'd like to just understand that final error state more.

Comment 13 David Vossel 2015-04-09 23:49:17 UTC

(In reply to Michael Bayer from comment #12)
> so can this error state be forced by just taking a galera node, putting a
> good UUID + -1 seqno in grastate.dat and starting?   I'd like to just
> understand that final error state more.

I don't think it's that simple. Honestly, I'm not sure we should spend much time trying to figure this one out.  I thought there might be a simpler way to reproduce this, but the steps in the description are the best I can do.

This bug appears to be an artifact of the galera resource-agent's usage of 'read-only' mode when determining the sequence number.

Rather than going down a rabbit hole to figure out why galera is doing this weird thing in our specific use case, lets just modify the resource-agent to no longer use 'read-only' mode. We can use 'mysqld --wsrep-recover' to get the sequence number without launching galera in read-only mode now. By doing that, we will bypass whatever is triggering this issue in our setup. It's possible this bug will never even occur if we remove the agent's usage of 'read-only'.

I'm in favor of moving this bug to resource-agents and i'll just make the bug "go away" for now by changing how the agent works. It's the path of least resistance  and I think the change to the agent is a good idea regardless. If we can manage to reproduce this issue after the resource-agent change, then we should re-approach an actual galera fix.

-- David

Comment 15 Michael Bayer 2015-04-10 20:18:57 UTC

OK, let's try that.

Comment 16 David Vossel 2015-04-13 21:42:46 UTC

I've made the necessary changes to the galera agent to remove usage of 'read-only' for the sequence number retrieval during bootstrap.

https://github.com/ClusterLabs/resource-agents/pull/597

Using the patch linked to above, I can no longer reproduce the issue of nodes "silently becoming out of sync". Now, using the same set of steps I used in the description all the nodes now sync properly. 

-- David

Comment 23 michal novacek 2015-08-11 13:39:25 UTC

I have verified that the rabbitmq-cluster agent is present in resource-agents resource-agents-3.9.5-50.el7.x86_64 and looks sane.

Comment 24 michal novacek 2015-08-12 12:58:29 UTC

Setting OtherQA according to comment #18.

Comment 30 Asaf Hirshberg 2015-10-07 10:11:03 UTC

Verified:
using: 7.0-RHEL-7-director/2015-10-01.1
       resource-agents-3.9.5-52.el7.x86_64 # wasn't include in the puddle #
       galera-25.3.5-7.el7ost.x86_64
       mariadb-galera-common-5.5.42-1.el7ost.x86_64
       mariadb-galera-server-5.5.42-1.el7ost.x86_64


reboot -f for controller1/2

[root@overcloud-controller-0 ~]# for i in $(seq 1 2000); do mysql mysql -e "CREATE
> DATABASE test$i;"; done

ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application usenode for application use
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
...

...
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use


before galera is up and running:

[root@overcloud-controller-0 mysql]# ls -lasd test* | wc -l
57

[heat-admin@overcloud-controller-1 mysql]$ ls -lasd test* | wc -l
57

[heat-admin@overcloud-controller-2 mysql]$ ls -lasd test* | wc -l
1


after:

Master/Slave Set: galera-master [galera]
     Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

[root@overcloud-controller-0 mysql]# ls -lasd test* | wc -l
57

[root@overcloud-controller-1 mysql]# ls -lasd test* | wc -l
57

[root@overcloud-controller-2 mysql]# ls -lasd test* | wc -l
57

Comment 32 errata-xmlrpc 2015-11-19 04:40:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2190.html