Bug 1336468
| Summary: | Controller replacement procedure fails at step 9 for enabling Galera on the new node | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Marius Cornea <mcornea> |
| Component: | documentation | Assignee: | RHOS Documentation Team <rhos-docs> |
| Status: | CLOSED DUPLICATE | QA Contact: | RHOS Documentation Team <rhos-docs> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 8.0 (Liberty) | CC: | dciabrin, srevivo |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-05-16 15:15:19 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Marius Cornea
2016-05-16 14:40:39 UTC
*** This bug has been marked as a duplicate of bug 1326507 *** Looking at the logs, it looks like every time the resource agent on controller #3 tries to store data in the CIB (via crm_resource or crm_master), nothing is stored/updated.
Log found on the DC node (controller #1) that could indicate issue:
May 16 14:27:18 overcloud-controller-3.localdomain attrd[6903]: crit: Node 'overcloud-controller-1' and 'overcloud-controller-3' share the same cluster nodeid 2: assuming 'overcloud-controller-1' is correct
This is what the resource agent normally does at startup:
. 1) The RA detects the last seqno on controller#3, store it in the CIB, and check whether other nodes have that info stored already.
. 2) As soon as all nodes have that info, or other nodes are master, the RA sets the resource to "Master" on controller#3 to promote the resource and start galera on that node.
It seems both of these action fail, as nothing is stored in the CIB for controller#3.
* failing action 1)
May 16 14:30:31 overcloud-controller-3.localdomain galera(galera)[10679]: INFO: attempting to detect last commit version by reading /var/lib/mysql/grastate.dat
May 16 14:30:31 overcloud-controller-3.localdomain galera(galera)[10691]: INFO: now attempting to detect last commit version using 'mysqld_safe --wsrep-recover'
May 16 14:30:35 overcloud-controller-3.localdomain galera(galera)[11577]: INFO: Last commit version found: -1
May 16 14:30:35 overcloud-controller-3.localdomain galera(galera)[11605]: INFO: Waiting on node <overcloud-controller-3> to report database status before Master instances can start.
Last log should not happen because the RA should have set key {"last-comitted": -1} in the CIB just before, which is apparently not the case.
# crm_mon -1A:
[...]
Node Attributes:
* Node overcloud-controller-0:
+ master-galera : 100
+ master-redis : 1
+ rmq-node-attr-last-known-rabbitmq : rabbit@overcloud-controller-0
+ rmq-node-attr-rabbitmq : rabbit@overcloud-controller-0
* Node overcloud-controller-2:
+ master-redis : 1
+ rmq-node-attr-last-known-rabbitmq : rabbit@overcloud-controller-2
+ rmq-node-attr-rabbitmq : rabbit@overcloud-controller-2
* Node overcloud-controller-3:
+ rmq-node-attr-last-known-rabbitmq : rabbit@overcloud-controller-3
[...]
* failing action 2)
Cleaning up resource to force restart:
# pcs resource cleanup galera
In the journal:
[root@overcloud-controller-3 heat-admin]# journalctl --since today | grep galera
May 17 10:36:44 overcloud-controller-3.localdomain galera(galera)[29592]: INFO: attempting to detect last commit version by reading /var/lib/mysql/grastate.dat
May 17 10:36:44 overcloud-controller-3.localdomain galera(galera)[29604]: INFO: now attempting to detect last commit version using 'mysqld_safe --wsrep-recover
May 17 10:36:48 overcloud-controller-3.localdomain galera(galera)[30482]: INFO: Last commit version found: -1
May 17 10:36:48 overcloud-controller-3.localdomain galera(galera)[30491]: INFO: Master instances are already up, setting master score so this instance will join galera cluster.
But key {master-galera:100} not stored in the CIB neither. So pacemaker will never schedule a 'promote' and galera will never start on controller#3.
Note: the RA log various things which are expected at startup and not harmful:
May 17 10:36:48 overcloud-controller-3.localdomain lrmd[3753]: notice: galera_start_0:29501:stderr [ cat: /var/lib/mysql/grastate.dat: No such file or directory ]
May 17 10:37:48 overcloud-controller-3.localdomain galera(galera)[31962]: ERROR: MySQL is not running
|