Bug 1170376
Summary: | galera cluster silently remains out of sync after bootstrap. | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | David Vossel <dvossel> | |
Component: | resource-agents | Assignee: | Fabio Massimo Di Nitto <fdinitto> | |
Status: | CLOSED ERRATA | QA Contact: | Ofer Blaut <oblaut> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 7.2 | CC: | agk, ahirshbe, cfeist, cluster-maint, dmaley, fdinitto, jherrman, mbayer, mnovacek, nbarcet, oalbrigt, oblaut, yeylon | |
Target Milestone: | pre-dev-freeze | Keywords: | ZStream | |
Target Release: | 7.2 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | resource-agents-3.9.5-43.el7 | Doc Type: | Bug Fix | |
Doc Text: |
Prior to this update, a Galera cluster in certain circumstances became unsynchronized without displaying any error messages. Consequently, the nodes on the cluster replicated successfully, but the data was not identical. This update adjusts resource-agents to no longer use "read-only" mode with Galera clusters, which ensures that these clusters do not silently fail to synchronize.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1242339 (view as bug list) | Environment: | ||
Last Closed: | 2015-11-19 04:40:42 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1240394, 1242339 |
Description
David Vossel
2014-12-03 22:47:41 UTC
Ryan and I have been discussing this on IRC. He suggested deleting the /var/lib/mysql/grastate.dat file and seeing what happens when I bring a galera node back online. When i delete the grastate file, the SST occurs and the node coming online is up-to-date with whatever donor it connected to. -- David (In reply to David Vossel from comment #1) > Ryan and I have been discussing this on IRC. He suggested deleting the > /var/lib/mysql/grastate.dat file and seeing what happens when I bring a > galera node back online. When i delete the grastate file, the SST occurs and > the node coming online is up-to-date with whatever donor it connected to. > > -- David This is a potential workaround, but the point is why all 3 nodes had the same state recorded even if they were out of sync? (In reply to Fabio Massimo Di Nitto from comment #3) > (In reply to David Vossel from comment #1) > > Ryan and I have been discussing this on IRC. He suggested deleting the > > /var/lib/mysql/grastate.dat file and seeing what happens when I bring a > > galera node back online. When i delete the grastate file, the SST occurs and > > the node coming online is up-to-date with whatever donor it connected to. > > > > -- David > > This is a potential workaround, but the point is why all 3 nodes had the > same state recorded even if they were out of sync? I'm not sure it is a workaround. The grastate.dat file would need to have a seqno of -1 for it to make sense for us to delete it. Before we start galera on one of the nodes that doesn't update, grastate.data looks like this. cat /var/lib/mysql/grastate.dat # GALERA saved state version: 2.1 uuid: 49ddc25c-fd17-11e3-80b9-76596d7aecbb seqno: 18803526 cert_index: (In reply to David Vossel from comment #4) > (In reply to Fabio Massimo Di Nitto from comment #3) > > (In reply to David Vossel from comment #1) > > > Ryan and I have been discussing this on IRC. He suggested deleting the > > > /var/lib/mysql/grastate.dat file and seeing what happens when I bring a > > > galera node back online. When i delete the grastate file, the SST occurs and > > > the node coming online is up-to-date with whatever donor it connected to. > > > > > > -- David > > > > This is a potential workaround, but the point is why all 3 nodes had the > > same state recorded even if they were out of sync? > > I'm not sure it is a workaround. The grastate.dat file would need to have a > seqno of -1 for it to make sense for us to delete it. Before we start galera > on one of the nodes that doesn't update, grastate.data looks like this. > > > cat /var/lib/mysql/grastate.dat > # GALERA saved state > version: 2.1 > uuid: 49ddc25c-fd17-11e3-80b9-76596d7aecbb > seqno: 18803526 > cert_index: Since the nodes are being shutdown cleanly they are recording uuid/seqno in grastate.dat successfully. It seems that when they comes back online they do not think there is anything to sync. This might have something to do with the fact that the CREATE DATABASE operation is non-transactional. Nothing is being written into these tables. The reason removing the grastate.dat file works is because it forces a full SST (rsync) of the databases because there is node state ID to consider. This does seem odd. I wonder what happens if we try to write to a database that wasn't replicated on all nodes. Here are some more data points. I ran the following on two nodes. Both nodes are in the galera cluster and both think they are in sync. In reality rhos5-db2 is out of sync, and rhos5-db1 is in sync. cd /var/lib/mysql/ ls test1000 echo "use test1000;" | mysql ls test1000 echo "use test1000; CREATE TABLE testtable (id INT NOT NULL PRIMARY KEY AUTO_INCREMENT, myval CHAR(25));" | mysql ls test1000 echo "use test1000; INSERT INTO testtable (id, myval) VALUES (NULL, 'hooray2');" | mysql ls test1000 - Writes on out of sync node rhos5-db2 (db test1000 doesn't exist) [root@rhos5-db2 mysql]# ls test1000 ls: cannot access test1000: No such file or directory [root@rhos5-db2 mysql]# cd /var/lib/mysql/ [root@rhos5-db2 mysql]# ls test1000 ls: cannot access test1000: No such file or directory [root@rhos5-db2 mysql]# echo "use test1000;" | mysql ERROR 1049 (42000) at line 1: Unknown database 'test1000' [root@rhos5-db2 mysql]# ls test1000 ls: cannot access test1000: No such file or directory [root@rhos5-db2 mysql]# echo "use test1000; CREATE TABLE testtable (id INT NOT NULL PRIMARY KEY AUTO_INCREMENT, myval CHAR(25));" | mysql ERROR 1049 (42000) at line 1: Unknown database 'test1000' [root@rhos5-db2 mysql]# ls test1000 ls: cannot access test1000: No such file or directory [root@rhos5-db2 mysql]# echo "use test1000; INSERT INTO testtable (id, myval) VALUES (NULL, 'hooray2');" | mysql ERROR 1049 (42000) at line 1: Unknown database 'test1000' [root@rhos5-db2 mysql]# ls test1000 ls: cannot access test1000: No such file or directory [root@rhos5-db2 mysql]# [root@rhos5-db2 mysql]# ls test1000 ls: cannot access test1000: No such file or directory - Writes on synced node rhos5-db1 (db test1000 does exist) [root@dvossel-laptop dvossel]# ssh rhos5-db1 Last login: Thu Dec 4 10:16:21 2014 from mrg-07.vmnet.mpc.lab.eng.bos.redhat.com [root@rhos5-db1 ~]# cd /var/lib/mysql/ [root@rhos5-db1 mysql]# cd /var/lib/mysql/ [root@rhos5-db1 mysql]# ls test1000 db.opt [root@rhos5-db1 mysql]# echo "use test1000;" | mysql [root@rhos5-db1 mysql]# ls test1000 db.opt [root@rhos5-db1 mysql]# echo "use test1000; CREATE TABLE testtable (id INT NOT NULL PRIMARY KEY AUTO_INCREMENT, myval CHAR(25));" | mysql [root@rhos5-db1 mysql]# ls test1000 db.opt testtable.frm [root@rhos5-db1 mysql]# echo "use test1000; INSERT INTO testtable (id, myval) VALUES (NULL, 'hooray2');" | mysql [root@rhos5-db1 mysql]# ls test1000 db.opt testtable.frm Now here's the funny part. When I wrote to the db on rhos5-db1 successfully, rhos5-db2 crashed. The crash actually causes rhos5-db2 to recover correctly when galera restarts. -- David After a long discussion with David, we agree there are two issues. The first is with the resource agent itself. Because the resource agent will first start mariadb in read-only mode in order to determine the seqno, this causes the seqno in grastate.dat to get reset to -1, which later causes the synchronization problem. David, correct me if I have mistated this. This would explain why we cannot recreate this problem when mariadb is started/stopped manually. I've proposed that the resource agent use 'mysqld --wsrep-recover' as a means to determin a node's position, which would remove the need to start mariadb in read-only mode. This would be an improvement, potentially much fast way to determine a node's position. David is investigating. The second issue is why mariadb/galera allows a node that has a valud UUID but seqno set to -1 to join the cluster at all. According to [1] this indicates the node is in a bad state. We've observed that when grastate.dat is in this state the node will be allowed to join and report that it is sync'd eventhough no SST or IST occurs. It seems like the node should be prevented from joining, since once it is declared sync'd when it clearly is not (eg. databases are no copied as shown in description). In addition, if mariadb is stopped after this bogus sync, the seqno will be updated and appear to be in sync going forward, never synchronizing the databases. It appears that the only sane thing to do to prevent this is to delete grastate.dat to force a full SST, thus the node is actually in sync with the cluster. Would be nice to get clarification on the second issue from partner. (In reply to Ryan O'Hara from comment #8) > After a long discussion with David, we agree there are two issues. > > The first is with the resource agent itself. Because the resource agent will > first start mariadb in read-only mode in order to determine the seqno, this > causes the seqno in grastate.dat to get reset to -1, which later causes the > synchronization problem. David, correct me if I have mistated this. This > would explain why we cannot recreate this problem when mariadb is > started/stopped manually. I've proposed that the resource agent use 'mysqld > --wsrep-recover' as a means to determin a node's position, which would > remove the need to start mariadb in read-only mode. This would be an > improvement, potentially much fast way to determine a node's position. David > is investigating. > > The second issue is why mariadb/galera allows a node that has a valud UUID > but seqno set to -1 to join the cluster at all. According to [1] this > indicates the node is in a bad state. We've observed that when grastate.dat > is in this state the node will be allowed to join and report that it is > sync'd eventhough no SST or IST occurs. It seems like the node should be > prevented from joining, since once it is declared sync'd when it clearly is > not (eg. databases are no copied as shown in description). In addition, if > mariadb is stopped after this bogus sync, the seqno will be updated and > appear to be in sync going forward, never synchronizing the databases. This is ultimately the issue I'm concerned about here. The galera cluster happily forms with nodes in various different states and synchronization never properly occurs. This should never be possible. > It > appears that the only sane thing to do to prevent this is to delete > grastate.dat to force a full SST, thus the node is actually in sync with the > cluster. From what I gather, deleting the grastate.dat file means we are anticipating that galera might do something wrong, and attempting to trick galera into doing the right thing. Deleting this file may very well be part of the solution. This sort of logic belongs in the safe_mysqld startup script (which initializes galera) though. > > Would be nice to get clarification on the second issue from partner. agreed. The resource-agent optimization/fix is nice, but it doesn't guarantee there are not other ways galera could enter this unsynced state. -- David so can this error state be forced by just taking a galera node, putting a good UUID + -1 seqno in grastate.dat and starting? I'd like to just understand that final error state more. (In reply to Michael Bayer from comment #12) > so can this error state be forced by just taking a galera node, putting a > good UUID + -1 seqno in grastate.dat and starting? I'd like to just > understand that final error state more. I don't think it's that simple. Honestly, I'm not sure we should spend much time trying to figure this one out. I thought there might be a simpler way to reproduce this, but the steps in the description are the best I can do. This bug appears to be an artifact of the galera resource-agent's usage of 'read-only' mode when determining the sequence number. Rather than going down a rabbit hole to figure out why galera is doing this weird thing in our specific use case, lets just modify the resource-agent to no longer use 'read-only' mode. We can use 'mysqld --wsrep-recover' to get the sequence number without launching galera in read-only mode now. By doing that, we will bypass whatever is triggering this issue in our setup. It's possible this bug will never even occur if we remove the agent's usage of 'read-only'. I'm in favor of moving this bug to resource-agents and i'll just make the bug "go away" for now by changing how the agent works. It's the path of least resistance and I think the change to the agent is a good idea regardless. If we can manage to reproduce this issue after the resource-agent change, then we should re-approach an actual galera fix. -- David OK, let's try that. I've made the necessary changes to the galera agent to remove usage of 'read-only' for the sequence number retrieval during bootstrap. https://github.com/ClusterLabs/resource-agents/pull/597 Using the patch linked to above, I can no longer reproduce the issue of nodes "silently becoming out of sync". Now, using the same set of steps I used in the description all the nodes now sync properly. -- David I have verified that the rabbitmq-cluster agent is present in resource-agents resource-agents-3.9.5-50.el7.x86_64 and looks sane. Setting OtherQA according to comment #18. Verified:
using: 7.0-RHEL-7-director/2015-10-01.1
resource-agents-3.9.5-52.el7.x86_64 # wasn't include in the puddle #
galera-25.3.5-7.el7ost.x86_64
mariadb-galera-common-5.5.42-1.el7ost.x86_64
mariadb-galera-server-5.5.42-1.el7ost.x86_64
reboot -f for controller1/2
[root@overcloud-controller-0 ~]# for i in $(seq 1 2000); do mysql mysql -e "CREATE
> DATABASE test$i;"; done
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application usenode for application use
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
...
...
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
before galera is up and running:
[root@overcloud-controller-0 mysql]# ls -lasd test* | wc -l
57
[heat-admin@overcloud-controller-1 mysql]$ ls -lasd test* | wc -l
57
[heat-admin@overcloud-controller-2 mysql]$ ls -lasd test* | wc -l
1
after:
Master/Slave Set: galera-master [galera]
Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
[root@overcloud-controller-0 mysql]# ls -lasd test* | wc -l
57
[root@overcloud-controller-1 mysql]# ls -lasd test* | wc -l
57
[root@overcloud-controller-2 mysql]# ls -lasd test* | wc -l
57
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-2190.html |