1242339 – galera cluster silently remains out of sync after bootstrap.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1242339 - galera cluster silently remains out of sync after bootstrap.

Summary: galera cluster silently remains out of sync after bootstrap.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	7.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	David Vossel
QA Contact:	Asaf Hirshberg
Docs Contact:
URL:
Whiteboard:
Depends On:	1170376
Blocks:
TreeView+	depends on / blocked

Reported:	2015-07-13 06:45 UTC by Libor Miksik
Modified:	2019-08-15 04:51 UTC (History)
CC List:	14 users (show)
Fixed In Version:	resource-agents-3.9.5-40.el7_1.6
Doc Type:	Bug Fix
Doc Text:	Prior to this update, a Galera cluster in certain circumstances became unsynchronized without displaying any error messages. Consequently, the nodes on the cluster replicated successfully, but the data was not identical. This update adjusts resource-agents to no longer use "read-only" mode with Galera clusters, which ensures that these clusters do not silently fail to synchronize.
Clone Of:	1170376
Environment:
Last Closed:	2015-08-05 18:29:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
rhos-director ha environment failed galera on controller0 mysql logs (102.46 KB, text/plain) 2015-07-22 12:01 UTC, Asaf Hirshberg	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:1557	0	normal	SHIPPED_LIVE	resource-agents bug fix and enhancement update	2015-08-05 22:29:07 UTC

Description Libor Miksik 2015-07-13 06:45:37 UTC

This bug has been copied from bug #1170376 and has been proposed
to be backported to 7.1 z-stream (EUS).

Comment 5 Asaf Hirshberg 2015-07-22 11:59:22 UTC

After following the steps by David in bug 1242339 on my osp-director with HA environment:

1) on one controller run: for i in $(seq 1 2000); do mysql mysql -e "CREATE
     DATABASE test$i;"; done
2) boot the other 2 controllers.
3) wait for error from controller one.
4) when the 2 controllers boot up check before galera is started: 
     cd /var/li/mysql/ ; ls -lasd test* | wc -l
5) wait for master election with crm_mon
6) run ls -lasd test* | wc -l again.

The results where a bit different:
1)output of cd /var/li/mysql/ ; ls -lasd test* | wc -l:
[root@overcloud-controller-2 mysql]# ls -lasd test* | wc -l
404
[heat-admin@overcloud-controller-1 ~]$ cd /var/lib/mysql/;ls -lasd test* | wc -l
252
[heat-admin@overcloud-controller-0 ~]$ cd /var/lib/mysql/;ls -lasd test* | wc -l
1

2) output of pcs status: 
 Master/Slave Set: galera-master [galera]
     Slaves: [ overcloud-controller-1 overcloud-controller-2 ]
     Stopped: [ overcloud-controller-0 ]

* The controller "overcloud-controller-0" is master but  galera is down and when trying to start it I get:
Job for mariadb.service failed. See 'systemctl status mariadb.service' and 'journalctl -xn' for details.

3)
MariaDB [(none)]> SHOW STATUS LIKE 'wsrep_cluster_size';
+--------------------+-------+
| Variable_name      | Value |
+--------------------+-------+
| wsrep_cluster_size | 0     |
+--------------------+-------+

4) I tried to reboot controller 0 but it kept the galera master role.
Master/Slave Set: galera-master [galera]
     galera	(ocf::heartbeat:galera):	FAILED overcloud-controller-0

Comment 6 Asaf Hirshberg 2015-07-22 12:01:53 UTC

Created attachment 1054817 [details]
rhos-director ha environment failed galera on controller0 mysql logs

Comment 7 David Vossel 2015-07-22 17:33:19 UTC

None of this looks right.

In the sos report I see this log.

"galera_start_0:4413:stderr [ ocf-exit-reason:Slave instance did not start correctly in read-only mode"

That indicates that the new resource-agents package is not in use. This read-only logic doesn't exist in the package now. This log message shouldn't be seen.

Besides that, I can't identify what exactly is going on here. I see you mention things like "Job for mariadb.service failed. See 'systemctl status mariadb.service' and 'journalctl -xn' for details." Systemd isn't (or shouldn't be) involved at any point with galera in our deployment. If it is, then that is incorrect and will cause problems.

In the original description of the steps used, item 2 says "2) boot the other 2 controllers." I don't know what that means. Were the other 2 nodes not booted before the test began? Were they rebooted?

Also, there are a lot of logs to parse through the the sos reports you provided outside of this issue. Please provide a general starting timestamp associated with the steps you performed. Otherwise it is difficult to sort through all the noise.

Thanks,
-- David

Comment 8 Asaf Hirshberg 2015-07-23 05:04:43 UTC

 I just followed your steps from bug : Bug 1170376
where you also used systemctl (maybe unintentionally):
# service mariadb start                              #   
#Redirecting to /bin/systemctl start  mariadb.service#

1) Could you please name the package that shouldn't be used now?
2) Can you provide a more specific how-to? so I could try reproduce it 
   


David Vossel 2014-12-03 17:47:41 EST

Description of problem:

We have a test case where a galera cluster appears to get into a state where all the nodes think the are in sync, but they are in fact not. The result is all the nodes are happily replicating, but not all the data is the same.


Version-Release number of selected component (if applicable):

galera.x86_64                      25.3.5-5.el7ost               @OpenStack-5.0-RHEL-7-Puddle
mariadb-galera-common.x86_64       1:5.5.37-7.el7ost             @OpenStack-5.0-RHEL-7-Puddle
mariadb-galera-server.x86_64       1:5.5.37-7.el7ost             @OpenStack-5.0-RHEL-7-Puddle

How reproducible:
We have a deployment that reproduces this 100% of the time. I am unsure how reliable reproducing outside of the environment will be. This could be timing related.

Steps to Reproduce:

TEST

Force reboot 2 out of 3 nodes while DB is flooded:


[root@rhos5-db3 ~]# for i in $(seq 1 2000); do mysql mysql -e "CREATE
DATABASE test$i;"; done

EXPECTED ERRORS DUE TO GALERA DROPPING INTERNAL QUORUM:

[root@rhos5-db3 ~]# for i in $(seq 1 2000); do mysql mysql -e "CREATE
DATABASE test$i;"; done
ERROR 1213 (40001) at line 1: Deadlock found when trying to get lock;
try restarting transaction
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial
communication packet', system error: 104
ERROR 2002 (HY000): Can't connect to local MySQL server through socket
'/var/lib/mysql/mysql.sock' (2)
.....

BEFORE galera is up and running on the nodes check:

cd /var/lib/mysql/
ls -lasd test* | wc -l

[root@rhos5-db3 mysql]# ls -lasd test* | wc -l
168

[root@rhos5-db1 mysql]# ls -lasd test* | wc -l
1

[root@rhos5-db2 mysql]# ls -lasd test* | wc -l
86

wait for master election with crm_mon, the first master is rhos5-db3 (as
expected with the latest and greatest table).

Wait full master election:

 Master/Slave Set: galera-master [galera]
     Masters: [ rhos5-db1 rhos5-db2 rhos5-db3 ]

From here, it appears all the nodes are out of sync.  I investigated this more and found that I experienced the same behavior when manually bootstrapping the nodes using the systemd service.

Example: All galera instances are down, Bootstrapping using node rhos5-db3

- INITIATE BOOTSTRAP ON rhos5-db3

# service mariadb start
Redirecting to /bin/systemctl start  mariadb.service
# ls -lasd test* | wc -l
1702

- START GALERA ON rhos5-db2 and rhos5-db1

# service mariadb start
Redirecting to /bin/systemctl start  mariadb.service
# ls -lasd test* | wc -l
134

# service mariadb start
Redirecting to /bin/systemctl start  mariadb.service
# ls -lasd test* | wc -l
810

- RESULT: all nodes remain out of sync regardless if manually bootstrapped using systemd or started by galera resource-agent.

Comment 10 Fabio Massimo Di Nitto 2015-07-23 15:44:59 UTC

QE tested against the incorrect version of resource-agents.

Comment 12 Asaf Hirshberg 2015-07-27 07:47:18 UTC

Verified.

Steps for reproduce:
1) pcs cluster stop --all

2) on all controllers: rpm -Uvh http://download.devel.redhat.com/brewroot        /packages/resource-agents/3.9.5/40.el7_1.6/x86_64/resource-agents-3.9.5-40.el7_1.6.x86_64.rpm

3) pcs cluster start --all
4) for i in $(seq 1 2000); do mysql mysql -e "CREATE
   DATABASE test$i;"; done
5) reboot two controllers.
6) checking with : cd /var/lib/mysql/;ls -lasd test* | wc -l 

[root@overcloud-controller-0 mysql]# cd /var/lib/mysql/;ls -lasd test* | wc -l
549
[root@overcloud-controller-1 mysql]# cd /var/lib/mysql/;ls -lasd test* | wc -l
549
[root@overcloud-controller-2 mysql]# cd /var/lib/mysql/;ls -lasd test* | wc -l
549

Comment 13 Ofer Blaut 2015-07-27 08:09:59 UTC

verified by asaf on RHEL 7.1

Comment 15 errata-xmlrpc 2015-08-05 18:29:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1557.html

Note You need to log in before you can comment on or make changes to this bug.