Bug 1336468

Summary:	Controller replacement procedure fails at step 9 for enabling Galera on the new node
Product:	Red Hat OpenStack	Reporter:	Marius Cornea <mcornea>
Component:	documentation	Assignee:	RHOS Documentation Team <rhos-docs>
Status:	CLOSED DUPLICATE	QA Contact:	RHOS Documentation Team <rhos-docs>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	8.0 (Liberty)	CC:	dciabrin, srevivo
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-05-16 15:15:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Marius Cornea 2016-05-16 14:40:39 UTC

Description of problem:
The controller replacement procedure fails at step 9 on enabling Galera on the new node. I haven't been able to get all 3 nodes as Galera masters.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-0.8.14-9.el7ost.noarch

How reproducible:
a couple of times now

Steps to Reproduce:
Follow the steps documented at
https://access.stage.redhat.com/documentation/en/red-hat-openstack-platform/8/director-installation-and-usage/94-replacing-controller-nodes

Actual results:
After step 9: Enable Galera on the new node I get the following results:

 Master/Slave Set: galera-master [galera]
     galera	(ocf::heartbeat:galera):	FAILED Master overcloud-controller-0 (unmanaged)
     galera	(ocf::heartbeat:galera):	FAILED Master overcloud-controller-2 (unmanaged)
     Stopped: [ overcloud-controller-3 ]
 Clone Set: mongod-clone [mongod]
--
* galera_promote_0 on overcloud-controller-0 'unknown error' (1): call=371, status=complete, exitreason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.',
    last-rc-change='Mon May 16 14:30:35 2016', queued=0ms, exec=130ms
* galera_promote_0 on overcloud-controller-2 'unknown error' (1): call=367, status=complete, exitreason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.',
    last-rc-change='Mon May 16 14:30:40 2016', queued=0ms, exec=130ms
* galera_monitor_20000 on overcloud-controller-3 'not running' (7): call=891, status=complete, exitreason='none',
    last-rc-change='Mon May 16 14:39:36 2016', queued=57ms, exec=59ms

Expected results:
Galera start as master on all 3 controllers.

Comment 3 Marius Cornea 2016-05-16 15:15:19 UTC


*** This bug has been marked as a duplicate of bug 1326507 ***

Comment 4 Damien Ciabrini 2016-05-17 10:47:20 UTC

Looking at the logs, it looks like every time the resource agent on controller #3 tries to store data in the CIB (via crm_resource or crm_master), nothing is stored/updated. 

Log found on the DC node (controller #1) that could indicate issue:
May 16 14:27:18 overcloud-controller-3.localdomain attrd[6903]:     crit: Node 'overcloud-controller-1' and 'overcloud-controller-3' share the same cluster nodeid 2: assuming 'overcloud-controller-1' is correct


This is what the resource agent normally does at startup:
. 1) The RA detects the last seqno on controller#3, store it in the CIB, and check whether other nodes have that info stored already.
. 2) As soon as all nodes have that info, or other nodes are master, the RA sets the resource to "Master" on controller#3 to promote the resource and start galera on that node.

It seems both of these action fail, as nothing is stored in the CIB for controller#3.

* failing action 1)

May 16 14:30:31 overcloud-controller-3.localdomain galera(galera)[10679]: INFO: attempting to detect last commit version by reading /var/lib/mysql/grastate.dat
May 16 14:30:31 overcloud-controller-3.localdomain galera(galera)[10691]: INFO: now attempting to detect last commit version using 'mysqld_safe --wsrep-recover'
May 16 14:30:35 overcloud-controller-3.localdomain galera(galera)[11577]: INFO: Last commit version found:  -1
May 16 14:30:35 overcloud-controller-3.localdomain galera(galera)[11605]: INFO: Waiting on node <overcloud-controller-3> to report database status before Master instances can start.

Last log should not happen because the RA should have set key {"last-comitted": -1} in the CIB just before, which is apparently not the case.

   # crm_mon -1A:
   [...]
   Node Attributes: 
   * Node overcloud-controller-0: 
       + master-galera                     : 100 
       + master-redis                      : 1 
       + rmq-node-attr-last-known-rabbitmq : rabbit@overcloud-controller-0
       + rmq-node-attr-rabbitmq            : rabbit@overcloud-controller-0
   * Node overcloud-controller-2: 
       + master-redis                      : 1 
       + rmq-node-attr-last-known-rabbitmq : rabbit@overcloud-controller-2
       + rmq-node-attr-rabbitmq            : rabbit@overcloud-controller-2
   * Node overcloud-controller-3: 
       + rmq-node-attr-last-known-rabbitmq : rabbit@overcloud-controller-3 
   [...]

* failing action 2)

Cleaning up resource to force restart:

   # pcs resource cleanup galera

In the journal:
[root@overcloud-controller-3 heat-admin]# journalctl --since today | grep galera 
May 17 10:36:44 overcloud-controller-3.localdomain galera(galera)[29592]: INFO: attempting to detect last commit version by reading /var/lib/mysql/grastate.dat
May 17 10:36:44 overcloud-controller-3.localdomain galera(galera)[29604]: INFO: now attempting to detect last commit version using 'mysqld_safe --wsrep-recover
May 17 10:36:48 overcloud-controller-3.localdomain galera(galera)[30482]: INFO: Last commit version found:  -1 
May 17 10:36:48 overcloud-controller-3.localdomain galera(galera)[30491]: INFO: Master instances are already up, setting master score so this instance will join galera cluster. 

But key {master-galera:100} not stored in the CIB neither. So pacemaker will never schedule a 'promote' and galera will never start on controller#3.

Note:  the RA log various things which are expected at startup and not harmful:
May 17 10:36:48 overcloud-controller-3.localdomain lrmd[3753]:   notice: galera_start_0:29501:stderr [ cat: /var/lib/mysql/grastate.dat: No such file or directory ]
May 17 10:37:48 overcloud-controller-3.localdomain galera(galera)[31962]: ERROR: MySQL is not running