Bug 1240394
Summary: | Galera cluster goes out of sync after correct overcloud installation | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Qasim Sarfraz <qasims> | ||||||||
Component: | rhosp-director | Assignee: | Mike Orazi <morazi> | ||||||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Shai Revivo <srevivo> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 7.0 (Kilo) | CC: | cdevine, dmacpher, fdinitto, hbrock, mbayer, mburns, morazi, plancast, rhel-osp-director-maint, shivrao, sopatwar, sthillma | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | 10.0 (Newton) | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | n1kv | ||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: |
In some situations, the Galera cluster goes out of synchronization after successful Overcloud installation. This is due to an issue with Pacemaker. A future version of Pacemaker will include a fix. This package is aimed for release in the next z-stream version of the director.
|
Story Points: | --- | ||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2016-08-30 12:22:37 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 1170376 | ||||||||||
Bug Blocks: | |||||||||||
Attachments: |
|
Description
Qasim Sarfraz
2015-07-06 19:19:02 UTC
Created attachment 1048922 [details]
Neutron-Server logs for controller node
The logs have the trace after the service being restarted
If galara is going out of sync, this would need to be fixed for GA. Mike, can you confirm this is an widespread issue? The galera node shown in the log was stopped ungracefully such as via a kill -9 and cannot start again, until run with the --tc-heuristic-recover flag: [Note] Found 1 prepared transaction(s) in InnoDB 150705 7:46:12 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions. 150705 7:46:12 [ERROR] Aborting however, the galera cluster has other nodes which could be running OK, not clear if this is during initial bootstrap or what. "clustercheck" only checks one node at a time. We identified a behavior in the Pacemaker resource agent where it could ungracefully stop a galera node, however my understanding is that this was fixed. dvossel has the details on this. This is the first report we've seen on this behavior but Mike B has already given us some info & dvossel may have some additional insight as well. (In reply to Michael Bayer from comment #5) > The galera node shown in the log was stopped ungracefully such as via a kill > -9 and cannot start again, until run with the --tc-heuristic-recover flag: > > [Note] Found 1 prepared transaction(s) in InnoDB > 150705 7:46:12 [ERROR] Found 1 prepared transactions! It means that mysqld > was not shut down properly last time and critical recovery information (last > binlog or tc.log file) was manually deleted after a crash. You have to start > mysqld with --tc-heuristic-recover switch to commit or rollback pending > transactions. > 150705 7:46:12 [ERROR] Aborting > > however, the galera cluster has other nodes which could be running OK, not > clear if this is during initial bootstrap or what. "clustercheck" only > checks one node at a time. > > We identified a behavior in the Pacemaker resource agent where it could > ungracefully stop a galera node, however my understanding is that this was > fixed. dvossel has the details on this. Yes, I'm fairly certain we fixed this as a result of the patches associated with this bug. https://bugzilla.redhat.com/show_bug.cgi?id=1170376 Unfortunately, those changes are only scheduled for rhel 7.2 right now. We'll need to zstream the galera resource-agent fix to get it into 7.1 moving to A1 since this is dependent on a pacemaker fix. Created attachment 1066667 [details]
CRM_Report
Seeing this issue consistently on one of the setups. One controller-0 and controller-2 I am able to drop into mysql prompt and I see that the DB tables are created successfully for neutron for example. but when i try to use the db on controller-1 i get the below error:
MariaDB [(none)]> use nova;
ERROR 1047 (08S01): WSREP has not yet prepared node for application use
also pcs status shows this:
Failed actions:
galera_promote_0 on overcloud-controller-1 'unknown error' (1): call=104, status=complete, exit-reason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.', last-rc-change='Mon Aug 24 16:22:01 2015', queued=0ms, exec=94ms
galera_promote_0 on overcloud-controller-1 'unknown error' (1): call=104, status=complete, exit-reason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.', last-rc-change='Mon Aug 24 16:22:01 2015', queued=0ms, exec=94ms
rabbitmq_start_0 on overcloud-controller-1 'unknown error' (1): call=89, status=complete, exit-reason='none', last-rc-change='Mon Aug 24 16:21:15 2015', queued=0ms, exec=9651ms
neutron-openvswitch-agent_monitor_60000 on overcloud-controller-1 'not running' (7): call=267, status=complete, exit-reason='none', last-rc-change='Mon Aug 24 16:55:08 2015', queued=0ms, exec=0ms
galera_monitor_10000 on overcloud-controller-0 'ok' (0): call=94, status=complete, exit-reason='none', last-rc-change='Mon Aug 24 17:49:32 2015', queued=10434ms, exec=116ms
rabbitmq_monitor_10000 on overcloud-controller-0 'not running' (7): call=95, status=complete, exit-reason='none', last-rc-change='Mon Aug 24 17:49:26 2015', queued=4351ms, exec=250ms
neutron-openvswitch-agent_monitor_60000 on overcloud-controller-0 'not running' (7): call=285, status=complete, exit-reason='none', last-rc-change='Mon Aug 24 17:54:14 2015', queued=0ms, exec=0ms
galera_monitor_10000 on overcloud-controller-2 'ok' (0): call=83, status=complete, exit-reason='none', last-rc-change='Mon Aug 24 17:49:26 2015', queued=14590ms, exec=163ms
rabbitmq_monitor_10000 on overcloud-controller-2 'not running' (7): call=84, status=complete, exit-reason='none', last-rc-change='Mon Aug 24 17:49:14 2015', queued=1962ms, exec=258ms
On another setup, we are hitting this intermittently. I am attaching the crm_report.
what happens if you do a "pcs cleanup galera"? the fact that the node is running as a MySQL node means it isn't in an unstartable state, it just isn't synced. Need to test reboot case as described in bug #1160962. Scenario: Reboot controller (HA mode). Check if Galera cluster is in sync post-reboot. Based on comment #12, we have been waiting for information for almost a year. we havenĀ“t been able to reproduce the issue and many fixes have been done to the resource-agents and galera to address similar issues. Please retest with current releases and reopen if the problem persist. |