Description of problem: The controller replacement procedure fails at step 9 on enabling Galera on the new node. I haven't been able to get all 3 nodes as Galera masters. Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-0.8.14-9.el7ost.noarch How reproducible: a couple of times now Steps to Reproduce: Follow the steps documented at https://access.stage.redhat.com/documentation/en/red-hat-openstack-platform/8/director-installation-and-usage/94-replacing-controller-nodes Actual results: After step 9: Enable Galera on the new node I get the following results: Master/Slave Set: galera-master [galera] galera (ocf::heartbeat:galera): FAILED Master overcloud-controller-0 (unmanaged) galera (ocf::heartbeat:galera): FAILED Master overcloud-controller-2 (unmanaged) Stopped: [ overcloud-controller-3 ] Clone Set: mongod-clone [mongod] -- * galera_promote_0 on overcloud-controller-0 'unknown error' (1): call=371, status=complete, exitreason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.', last-rc-change='Mon May 16 14:30:35 2016', queued=0ms, exec=130ms * galera_promote_0 on overcloud-controller-2 'unknown error' (1): call=367, status=complete, exitreason='Failure, Attempted to promote Master instance of galera before bootstrap node has been detected.', last-rc-change='Mon May 16 14:30:40 2016', queued=0ms, exec=130ms * galera_monitor_20000 on overcloud-controller-3 'not running' (7): call=891, status=complete, exitreason='none', last-rc-change='Mon May 16 14:39:36 2016', queued=57ms, exec=59ms Expected results: Galera start as master on all 3 controllers.
*** This bug has been marked as a duplicate of bug 1326507 ***
Looking at the logs, it looks like every time the resource agent on controller #3 tries to store data in the CIB (via crm_resource or crm_master), nothing is stored/updated. Log found on the DC node (controller #1) that could indicate issue: May 16 14:27:18 overcloud-controller-3.localdomain attrd[6903]: crit: Node 'overcloud-controller-1' and 'overcloud-controller-3' share the same cluster nodeid 2: assuming 'overcloud-controller-1' is correct This is what the resource agent normally does at startup: . 1) The RA detects the last seqno on controller#3, store it in the CIB, and check whether other nodes have that info stored already. . 2) As soon as all nodes have that info, or other nodes are master, the RA sets the resource to "Master" on controller#3 to promote the resource and start galera on that node. It seems both of these action fail, as nothing is stored in the CIB for controller#3. * failing action 1) May 16 14:30:31 overcloud-controller-3.localdomain galera(galera)[10679]: INFO: attempting to detect last commit version by reading /var/lib/mysql/grastate.dat May 16 14:30:31 overcloud-controller-3.localdomain galera(galera)[10691]: INFO: now attempting to detect last commit version using 'mysqld_safe --wsrep-recover' May 16 14:30:35 overcloud-controller-3.localdomain galera(galera)[11577]: INFO: Last commit version found: -1 May 16 14:30:35 overcloud-controller-3.localdomain galera(galera)[11605]: INFO: Waiting on node <overcloud-controller-3> to report database status before Master instances can start. Last log should not happen because the RA should have set key {"last-comitted": -1} in the CIB just before, which is apparently not the case. # crm_mon -1A: [...] Node Attributes: * Node overcloud-controller-0: + master-galera : 100 + master-redis : 1 + rmq-node-attr-last-known-rabbitmq : rabbit@overcloud-controller-0 + rmq-node-attr-rabbitmq : rabbit@overcloud-controller-0 * Node overcloud-controller-2: + master-redis : 1 + rmq-node-attr-last-known-rabbitmq : rabbit@overcloud-controller-2 + rmq-node-attr-rabbitmq : rabbit@overcloud-controller-2 * Node overcloud-controller-3: + rmq-node-attr-last-known-rabbitmq : rabbit@overcloud-controller-3 [...] * failing action 2) Cleaning up resource to force restart: # pcs resource cleanup galera In the journal: [root@overcloud-controller-3 heat-admin]# journalctl --since today | grep galera May 17 10:36:44 overcloud-controller-3.localdomain galera(galera)[29592]: INFO: attempting to detect last commit version by reading /var/lib/mysql/grastate.dat May 17 10:36:44 overcloud-controller-3.localdomain galera(galera)[29604]: INFO: now attempting to detect last commit version using 'mysqld_safe --wsrep-recover May 17 10:36:48 overcloud-controller-3.localdomain galera(galera)[30482]: INFO: Last commit version found: -1 May 17 10:36:48 overcloud-controller-3.localdomain galera(galera)[30491]: INFO: Master instances are already up, setting master score so this instance will join galera cluster. But key {master-galera:100} not stored in the CIB neither. So pacemaker will never schedule a 'promote' and galera will never start on controller#3. Note: the RA log various things which are expected at startup and not harmful: May 17 10:36:48 overcloud-controller-3.localdomain lrmd[3753]: notice: galera_start_0:29501:stderr [ cat: /var/lib/mysql/grastate.dat: No such file or directory ] May 17 10:37:48 overcloud-controller-3.localdomain galera(galera)[31962]: ERROR: MySQL is not running