Bug 1242339
Summary: | galera cluster silently remains out of sync after bootstrap. | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Libor Miksik <lmiksik> | ||||
Component: | resource-agents | Assignee: | David Vossel <dvossel> | ||||
Status: | CLOSED ERRATA | QA Contact: | Asaf Hirshberg <ahirshbe> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 7.2 | CC: | agk, cfeist, cluster-maint, djansa, dmaley, dvossel, fdinitto, jherrman, jshortt, mbayer, mnovacek, nbarcet, oblaut, yeylon | ||||
Target Milestone: | rc | Keywords: | ZStream | ||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | resource-agents-3.9.5-40.el7_1.6 | Doc Type: | Bug Fix | ||||
Doc Text: |
Prior to this update, a Galera cluster in certain circumstances became unsynchronized without displaying any error messages. Consequently, the nodes on the cluster replicated successfully, but the data was not identical. This update adjusts resource-agents to no longer use "read-only" mode with Galera clusters, which ensures that these clusters do not silently fail to synchronize.
|
Story Points: | --- | ||||
Clone Of: | 1170376 | Environment: | |||||
Last Closed: | 2015-08-05 18:29:22 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1170376 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Libor Miksik
2015-07-13 06:45:37 UTC
After following the steps by David in bug 1242339 on my osp-director with HA environment: 1) on one controller run: for i in $(seq 1 2000); do mysql mysql -e "CREATE DATABASE test$i;"; done 2) boot the other 2 controllers. 3) wait for error from controller one. 4) when the 2 controllers boot up check before galera is started: cd /var/li/mysql/ ; ls -lasd test* | wc -l 5) wait for master election with crm_mon 6) run ls -lasd test* | wc -l again. The results where a bit different: 1)output of cd /var/li/mysql/ ; ls -lasd test* | wc -l: [root@overcloud-controller-2 mysql]# ls -lasd test* | wc -l 404 [heat-admin@overcloud-controller-1 ~]$ cd /var/lib/mysql/;ls -lasd test* | wc -l 252 [heat-admin@overcloud-controller-0 ~]$ cd /var/lib/mysql/;ls -lasd test* | wc -l 1 2) output of pcs status: Master/Slave Set: galera-master [galera] Slaves: [ overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-controller-0 ] * The controller "overcloud-controller-0" is master but galera is down and when trying to start it I get: Job for mariadb.service failed. See 'systemctl status mariadb.service' and 'journalctl -xn' for details. 3) MariaDB [(none)]> SHOW STATUS LIKE 'wsrep_cluster_size'; +--------------------+-------+ | Variable_name | Value | +--------------------+-------+ | wsrep_cluster_size | 0 | +--------------------+-------+ 4) I tried to reboot controller 0 but it kept the galera master role. Master/Slave Set: galera-master [galera] galera (ocf::heartbeat:galera): FAILED overcloud-controller-0 Created attachment 1054817 [details]
rhos-director ha environment failed galera on controller0 mysql logs
None of this looks right. In the sos report I see this log. "galera_start_0:4413:stderr [ ocf-exit-reason:Slave instance did not start correctly in read-only mode" That indicates that the new resource-agents package is not in use. This read-only logic doesn't exist in the package now. This log message shouldn't be seen. Besides that, I can't identify what exactly is going on here. I see you mention things like "Job for mariadb.service failed. See 'systemctl status mariadb.service' and 'journalctl -xn' for details." Systemd isn't (or shouldn't be) involved at any point with galera in our deployment. If it is, then that is incorrect and will cause problems. In the original description of the steps used, item 2 says "2) boot the other 2 controllers." I don't know what that means. Were the other 2 nodes not booted before the test began? Were they rebooted? Also, there are a lot of logs to parse through the the sos reports you provided outside of this issue. Please provide a general starting timestamp associated with the steps you performed. Otherwise it is difficult to sort through all the noise. Thanks, -- David I just followed your steps from bug : Bug 1170376 where you also used systemctl (maybe unintentionally): # service mariadb start # #Redirecting to /bin/systemctl start mariadb.service# 1) Could you please name the package that shouldn't be used now? 2) Can you provide a more specific how-to? so I could try reproduce it David Vossel 2014-12-03 17:47:41 EST Description of problem: We have a test case where a galera cluster appears to get into a state where all the nodes think the are in sync, but they are in fact not. The result is all the nodes are happily replicating, but not all the data is the same. Version-Release number of selected component (if applicable): galera.x86_64 25.3.5-5.el7ost @OpenStack-5.0-RHEL-7-Puddle mariadb-galera-common.x86_64 1:5.5.37-7.el7ost @OpenStack-5.0-RHEL-7-Puddle mariadb-galera-server.x86_64 1:5.5.37-7.el7ost @OpenStack-5.0-RHEL-7-Puddle How reproducible: We have a deployment that reproduces this 100% of the time. I am unsure how reliable reproducing outside of the environment will be. This could be timing related. Steps to Reproduce: TEST Force reboot 2 out of 3 nodes while DB is flooded: [root@rhos5-db3 ~]# for i in $(seq 1 2000); do mysql mysql -e "CREATE DATABASE test$i;"; done EXPECTED ERRORS DUE TO GALERA DROPPING INTERNAL QUORUM: [root@rhos5-db3 ~]# for i in $(seq 1 2000); do mysql mysql -e "CREATE DATABASE test$i;"; done ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock; try restarting transaction ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 104 ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2) ..... BEFORE galera is up and running on the nodes check: cd /var/lib/mysql/ ls -lasd test* | wc -l [root@rhos5-db3 mysql]# ls -lasd test* | wc -l 168 [root@rhos5-db1 mysql]# ls -lasd test* | wc -l 1 [root@rhos5-db2 mysql]# ls -lasd test* | wc -l 86 wait for master election with crm_mon, the first master is rhos5-db3 (as expected with the latest and greatest table). Wait full master election: Master/Slave Set: galera-master [galera] Masters: [ rhos5-db1 rhos5-db2 rhos5-db3 ] From here, it appears all the nodes are out of sync. I investigated this more and found that I experienced the same behavior when manually bootstrapping the nodes using the systemd service. Example: All galera instances are down, Bootstrapping using node rhos5-db3 - INITIATE BOOTSTRAP ON rhos5-db3 # service mariadb start Redirecting to /bin/systemctl start mariadb.service # ls -lasd test* | wc -l 1702 - START GALERA ON rhos5-db2 and rhos5-db1 # service mariadb start Redirecting to /bin/systemctl start mariadb.service # ls -lasd test* | wc -l 134 # service mariadb start Redirecting to /bin/systemctl start mariadb.service # ls -lasd test* | wc -l 810 - RESULT: all nodes remain out of sync regardless if manually bootstrapped using systemd or started by galera resource-agent. QE tested against the incorrect version of resource-agents. Verified. Steps for reproduce: 1) pcs cluster stop --all 2) on all controllers: rpm -Uvh http://download.devel.redhat.com/brewroot /packages/resource-agents/3.9.5/40.el7_1.6/x86_64/resource-agents-3.9.5-40.el7_1.6.x86_64.rpm 3) pcs cluster start --all 4) for i in $(seq 1 2000); do mysql mysql -e "CREATE DATABASE test$i;"; done 5) reboot two controllers. 6) checking with : cd /var/lib/mysql/;ls -lasd test* | wc -l [root@overcloud-controller-0 mysql]# cd /var/lib/mysql/;ls -lasd test* | wc -l 549 [root@overcloud-controller-1 mysql]# cd /var/lib/mysql/;ls -lasd test* | wc -l 549 [root@overcloud-controller-2 mysql]# cd /var/lib/mysql/;ls -lasd test* | wc -l 549 verified by asaf on RHEL 7.1 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-1557.html |