1240394 – Galera cluster goes out of sync after correct overcloud installation

Bug 1240394 - Galera cluster goes out of sync after correct overcloud installation

Summary: Galera cluster goes out of sync after correct overcloud installation

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rhosp-director
Sub Component:
Version:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	10.0 (Newton)
Assignee:	Mike Orazi
QA Contact:	Shai Revivo
Docs Contact:
URL:
Whiteboard:	n1kv
Depends On:	1170376
Blocks:
TreeView+	depends on / blocked

Reported:	2015-07-06 19:19 UTC by Qasim Sarfraz
Modified:	2016-08-30 12:22 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	In some situations, the Galera cluster goes out of synchronization after successful Overcloud installation. This is due to an issue with Pacemaker. A future version of Pacemaker will include a fix. This package is aimed for release in the next z-stream version of the director.
Clone Of:
Environment:
Last Closed:	2016-08-30 12:22:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
MariaDB logs for the controller node (15.72 KB, text/plain) 2015-07-06 19:19 UTC, Qasim Sarfraz	no flags	Details
Neutron-Server logs for controller node (13.27 MB, text/plain) 2015-07-06 19:21 UTC, Qasim Sarfraz	no flags	Details
CRM_Report (3.46 MB, application/x-bzip) 2015-08-25 00:05 UTC, Shiva Prasad Rao	no flags	Details
View All

Description Qasim Sarfraz 2015-07-06 19:19:02 UTC

Created attachment 1048916 [details]
MariaDB logs for the controller node

Description of problem:
Galera cluster goes out of sync after correct overcloud installation. Database is not accessible, as a result Openstack services aren't able to start. 

$ mysql
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)

I tried using the workaround ( https://bugzilla.redhat.com/show_bug.cgi?id=1235458 ). MYSQL console is accessible but gallera is still not synced: 

$ /usr/bin/clustercheck 
HTTP/1.1 503 Service Unavailable
Content-Type: text/plain
Connection: close
Content-Length: 36

Galera cluster node is not synced.

Now, If I try starting any OpenStack service, It fails to start.  

Version-Release number of selected component (if applicable):
Core: 7.0-RC-2
Director: director-Beta-2

How reproducible:
Unknown, facing this issue randomly


Additional info:
Have attached all the relevant logs for detailed analysis of the issue.

Comment 3 Qasim Sarfraz 2015-07-06 19:21:14 UTC

Created attachment 1048922 [details]
Neutron-Server logs for controller node

The logs have the trace after the service being restarted

Comment 4 chris alfonso 2015-07-07 17:18:19 UTC

If galara is going out of sync, this would need to be fixed for GA. Mike, can you confirm this is an widespread issue?

Comment 5 Michael Bayer 2015-07-07 21:25:05 UTC

The galera node shown in the log was stopped ungracefully such as via a kill -9 and cannot start again, until run with the --tc-heuristic-recover flag:

[Note] Found 1 prepared transaction(s) in InnoDB
150705  7:46:12 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions.
150705  7:46:12 [ERROR] Aborting

however, the galera cluster has other nodes which could be running OK, not clear if this is during initial bootstrap or what.  "clustercheck" only checks one node at a time.

We identified a behavior in the Pacemaker resource agent where it could ungracefully stop a galera node, however my understanding is that this was fixed.  dvossel has the details on this.

Comment 6 Mike Orazi 2015-07-07 21:32:31 UTC

This is the first report we've seen on this behavior but Mike B has already given us some info & dvossel may have some additional insight as well.

Comment 7 David Vossel 2015-07-07 21:35:11 UTC

(In reply to Michael Bayer from comment #5)
> The galera node shown in the log was stopped ungracefully such as via a kill
> -9 and cannot start again, until run with the --tc-heuristic-recover flag:
> 
> [Note] Found 1 prepared transaction(s) in InnoDB
> 150705  7:46:12 [ERROR] Found 1 prepared transactions! It means that mysqld
> was not shut down properly last time and critical recovery information (last
> binlog or tc.log file) was manually deleted after a crash. You have to start
> mysqld with --tc-heuristic-recover switch to commit or rollback pending
> transactions.
> 150705  7:46:12 [ERROR] Aborting
> 
> however, the galera cluster has other nodes which could be running OK, not
> clear if this is during initial bootstrap or what.  "clustercheck" only
> checks one node at a time.
> 
> We identified a behavior in the Pacemaker resource agent where it could
> ungracefully stop a galera node, however my understanding is that this was
> fixed.  dvossel has the details on this.

Yes, I'm fairly certain we fixed this as a result of the patches associated with this bug.
https://bugzilla.redhat.com/show_bug.cgi?id=1170376

Unfortunately, those changes are only scheduled for rhel 7.2 right now. We'll need to zstream the galera resource-agent fix to get it into 7.1

Comment 8 Mike Burns 2015-07-08 19:53:44 UTC

moving to A1 since this is dependent on a pacemaker fix.

Comment 10 Shiva Prasad Rao 2015-08-25 00:05:57 UTC

Created attachment 1066667 [details]
CRM_Report

Seeing this issue consistently on one of the setups. One controller-0 and controller-2 I am able to drop into mysql prompt and I see that the DB tables are created successfully for neutron for example. but when i try to use the db on controller-1 i get the below error:
MariaDB [(none)]> use nova;
ERROR 1047 (08S01): WSREP has not yet prepared node for application use

also pcs status shows this:
Failed actions:
    galera_promote_0 on overcloud-controller-1 'unknown error' (1): call=104, status=complete, exit-reason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.', last-rc-change='Mon Aug 24 16:22:01 2015', queued=0ms, exec=94ms
    galera_promote_0 on overcloud-controller-1 'unknown error' (1): call=104, status=complete, exit-reason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.', last-rc-change='Mon Aug 24 16:22:01 2015', queued=0ms, exec=94ms
    rabbitmq_start_0 on overcloud-controller-1 'unknown error' (1): call=89, status=complete, exit-reason='none', last-rc-change='Mon Aug 24 16:21:15 2015', queued=0ms, exec=9651ms
    neutron-openvswitch-agent_monitor_60000 on overcloud-controller-1 'not running' (7): call=267, status=complete, exit-reason='none', last-rc-change='Mon Aug 24 16:55:08 2015', queued=0ms, exec=0ms
    galera_monitor_10000 on overcloud-controller-0 'ok' (0): call=94, status=complete, exit-reason='none', last-rc-change='Mon Aug 24 17:49:32 2015', queued=10434ms, exec=116ms
    rabbitmq_monitor_10000 on overcloud-controller-0 'not running' (7): call=95, status=complete, exit-reason='none', last-rc-change='Mon Aug 24 17:49:26 2015', queued=4351ms, exec=250ms
    neutron-openvswitch-agent_monitor_60000 on overcloud-controller-0 'not running' (7): call=285, status=complete, exit-reason='none', last-rc-change='Mon Aug 24 17:54:14 2015', queued=0ms, exec=0ms
    galera_monitor_10000 on overcloud-controller-2 'ok' (0): call=83, status=complete, exit-reason='none', last-rc-change='Mon Aug 24 17:49:26 2015', queued=14590ms, exec=163ms
    rabbitmq_monitor_10000 on overcloud-controller-2 'not running' (7): call=84, status=complete, exit-reason='none', last-rc-change='Mon Aug 24 17:49:14 2015', queued=1962ms, exec=258ms


On another setup, we are hitting this intermittently. I am attaching the crm_report.

Comment 11 Michael Bayer 2015-08-25 00:58:29 UTC

what happens if you do a "pcs cleanup galera"?   the fact that the node is running as a MySQL node means it isn't in an unstartable state, it just isn't synced.

Comment 12 Sourabh Patwardhan 2015-09-10 21:54:29 UTC

Need to test reboot case as described in bug #1160962.
Scenario: Reboot controller (HA mode).
Check if Galera cluster is in sync post-reboot.

Comment 16 Fabio Massimo Di Nitto 2016-08-30 12:22:37 UTC

Based on comment #12, we have been waiting for information for almost a year.

we haven´t been able to reproduce the issue and many fixes have been done to the resource-agents and galera to address similar issues.

Please retest with current releases and reopen if the problem persist.

Note You need to log in before you can comment on or make changes to this bug.