Bug 1240394

Summary:

Galera cluster goes out of sync after correct overcloud installation

Product:

Red Hat OpenStack

Reporter:

Qasim Sarfraz <qasims>

Component:

rhosp-director

Assignee:

Mike Orazi <morazi>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Shai Revivo <srevivo>

Severity:

urgent

Docs Contact:

Priority:

high

Version:

7.0 (Kilo)

CC:

cdevine, dmacpher, fdinitto, hbrock, mbayer, mburns, morazi, plancast, rhel-osp-director-maint, shivrao, sopatwar, sthillma

Target Milestone:

---

Target Release:

10.0 (Newton)

Hardware:

Unspecified

OS:

Linux

Whiteboard:

n1kv

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

In some situations, the Galera cluster goes out of synchronization after successful Overcloud installation. This is due to an issue with Pacemaker. A future version of Pacemaker will include a fix. This package is aimed for release in the next z-stream version of the director.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-08-30 12:22:37 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1170376

Bug Blocks:

Attachments:

Description	Flags
MariaDB logs for the controller node	none
Neutron-Server logs for controller node	none
CRM_Report	none

Description Qasim Sarfraz 2015-07-06 19:19:02 UTC

Created attachment 1048916 [details]
MariaDB logs for the controller node

Description of problem:
Galera cluster goes out of sync after correct overcloud installation. Database is not accessible, as a result Openstack services aren't able to start. 

$ mysql
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)

I tried using the workaround ( https://bugzilla.redhat.com/show_bug.cgi?id=1235458 ). MYSQL console is accessible but gallera is still not synced: 

$ /usr/bin/clustercheck 
HTTP/1.1 503 Service Unavailable
Content-Type: text/plain
Connection: close
Content-Length: 36

Galera cluster node is not synced.

Now, If I try starting any OpenStack service, It fails to start.  

Version-Release number of selected component (if applicable):
Core: 7.0-RC-2
Director: director-Beta-2

How reproducible:
Unknown, facing this issue randomly


Additional info:
Have attached all the relevant logs for detailed analysis of the issue.

Comment 3 Qasim Sarfraz 2015-07-06 19:21:14 UTC

Created attachment 1048922 [details]
Neutron-Server logs for controller node

The logs have the trace after the service being restarted

Comment 4 chris alfonso 2015-07-07 17:18:19 UTC

If galara is going out of sync, this would need to be fixed for GA. Mike, can you confirm this is an widespread issue?

Comment 5 Michael Bayer 2015-07-07 21:25:05 UTC

The galera node shown in the log was stopped ungracefully such as via a kill -9 and cannot start again, until run with the --tc-heuristic-recover flag:

[Note] Found 1 prepared transaction(s) in InnoDB
150705  7:46:12 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions.
150705  7:46:12 [ERROR] Aborting

however, the galera cluster has other nodes which could be running OK, not clear if this is during initial bootstrap or what.  "clustercheck" only checks one node at a time.

We identified a behavior in the Pacemaker resource agent where it could ungracefully stop a galera node, however my understanding is that this was fixed.  dvossel has the details on this.

Comment 6 Mike Orazi 2015-07-07 21:32:31 UTC

This is the first report we've seen on this behavior but Mike B has already given us some info & dvossel may have some additional insight as well.

Comment 7 David Vossel 2015-07-07 21:35:11 UTC

(In reply to Michael Bayer from comment #5)
> The galera node shown in the log was stopped ungracefully such as via a kill
> -9 and cannot start again, until run with the --tc-heuristic-recover flag:
> 
> [Note] Found 1 prepared transaction(s) in InnoDB
> 150705  7:46:12 [ERROR] Found 1 prepared transactions! It means that mysqld
> was not shut down properly last time and critical recovery information (last
> binlog or tc.log file) was manually deleted after a crash. You have to start
> mysqld with --tc-heuristic-recover switch to commit or rollback pending
> transactions.
> 150705  7:46:12 [ERROR] Aborting
> 
> however, the galera cluster has other nodes which could be running OK, not
> clear if this is during initial bootstrap or what.  "clustercheck" only
> checks one node at a time.
> 
> We identified a behavior in the Pacemaker resource agent where it could
> ungracefully stop a galera node, however my understanding is that this was
> fixed.  dvossel has the details on this.

Yes, I'm fairly certain we fixed this as a result of the patches associated with this bug.
https://bugzilla.redhat.com/show_bug.cgi?id=1170376

Unfortunately, those changes are only scheduled for rhel 7.2 right now. We'll need to zstream the galera resource-agent fix to get it into 7.1

Comment 8 Mike Burns 2015-07-08 19:53:44 UTC

moving to A1 since this is dependent on a pacemaker fix.

Comment 10 Shiva Prasad Rao 2015-08-25 00:05:57 UTC

Created attachment 1066667 [details]
CRM_Report

Seeing this issue consistently on one of the setups. One controller-0 and controller-2 I am able to drop into mysql prompt and I see that the DB tables are created successfully for neutron for example. but when i try to use the db on controller-1 i get the below error:
MariaDB [(none)]> use nova;
ERROR 1047 (08S01): WSREP has not yet prepared node for application use

also pcs status shows this:
Failed actions:
    galera_promote_0 on overcloud-controller-1 'unknown error' (1): call=104, status=complete, exit-reason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.', last-rc-change='Mon Aug 24 16:22:01 2015', queued=0ms, exec=94ms
    galera_promote_0 on overcloud-controller-1 'unknown error' (1): call=104, status=complete, exit-reason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.', last-rc-change='Mon Aug 24 16:22:01 2015', queued=0ms, exec=94ms
    rabbitmq_start_0 on overcloud-controller-1 'unknown error' (1): call=89, status=complete, exit-reason='none', last-rc-change='Mon Aug 24 16:21:15 2015', queued=0ms, exec=9651ms
    neutron-openvswitch-agent_monitor_60000 on overcloud-controller-1 'not running' (7): call=267, status=complete, exit-reason='none', last-rc-change='Mon Aug 24 16:55:08 2015', queued=0ms, exec=0ms
    galera_monitor_10000 on overcloud-controller-0 'ok' (0): call=94, status=complete, exit-reason='none', last-rc-change='Mon Aug 24 17:49:32 2015', queued=10434ms, exec=116ms
    rabbitmq_monitor_10000 on overcloud-controller-0 'not running' (7): call=95, status=complete, exit-reason='none', last-rc-change='Mon Aug 24 17:49:26 2015', queued=4351ms, exec=250ms
    neutron-openvswitch-agent_monitor_60000 on overcloud-controller-0 'not running' (7): call=285, status=complete, exit-reason='none', last-rc-change='Mon Aug 24 17:54:14 2015', queued=0ms, exec=0ms
    galera_monitor_10000 on overcloud-controller-2 'ok' (0): call=83, status=complete, exit-reason='none', last-rc-change='Mon Aug 24 17:49:26 2015', queued=14590ms, exec=163ms
    rabbitmq_monitor_10000 on overcloud-controller-2 'not running' (7): call=84, status=complete, exit-reason='none', last-rc-change='Mon Aug 24 17:49:14 2015', queued=1962ms, exec=258ms


On another setup, we are hitting this intermittently. I am attaching the crm_report.

Comment 11 Michael Bayer 2015-08-25 00:58:29 UTC

what happens if you do a "pcs cleanup galera"?   the fact that the node is running as a MySQL node means it isn't in an unstartable state, it just isn't synced.

Comment 12 Sourabh Patwardhan 2015-09-10 21:54:29 UTC

Need to test reboot case as described in bug #1160962.
Scenario: Reboot controller (HA mode).
Check if Galera cluster is in sync post-reboot.

Comment 16 Fabio Massimo Di Nitto 2016-08-30 12:22:37 UTC

Based on comment #12, we have been waiting for information for almost a year.

we haven´t been able to reproduce the issue and many fixes have been done to the resource-agents and galera to address similar issues.

Please retest with current releases and reopen if the problem persist.