Bug 1263981
Summary: | pacemaker does not recover galera resource on Master controller & leaves it 'unmanaged' | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Jaison Raju <jraju> |
Component: | pacemaker | Assignee: | Andrew Beekhof <abeekhof> |
Status: | CLOSED NOTABUG | QA Contact: | cluster-qe <cluster-qe> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.1 | CC: | cluster-maint, jraju |
Target Milestone: | rc | Keywords: | Unconfirmed |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-09-24 00:20:25 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jaison Raju
2015-09-17 09:06:00 UTC
> Resource clear / cleanup / restart did not fix this issue .
These commands only do something useful if the underlying problem was either fixed or transient in nature.
It looks like either the network or corosync had a full meltdown, no node was able to see any of its peers:
Sep 09 09:35:48 [16343] controller3.sparklab crmd: notice: crm_update_peer_state: pcmk_quorum_notification: Node pcmk-controller1[1] - state is now member (was lost)
Sep 09 09:35:48 [16343] controller3.sparklab crmd: notice: crm_update_peer_state: pcmk_quorum_notification: Node pcmk-controller2[2] - state is now member (was lost)
Sep 09 09:35:48 [16373] controller1.sparklab crmd: notice: crm_update_peer_state: pcmk_quorum_notification: Node pcmk-controller2[2] - state is now member (was lost)
Sep 09 09:35:48 [16373] controller1.sparklab crmd: notice: crm_update_peer_state: pcmk_quorum_notification: Node pcmk-controller3[3] - state is now member (was lost)
Sep 09 09:35:48 [16392] controller2.sparklab crmd: notice: crm_update_peer_state: pcmk_quorum_notification: Node pcmk-controller1[1] - state is now member (was lost)
Sep 09 09:35:48 [16392] controller2.sparklab crmd: notice: crm_update_peer_state: pcmk_quorum_notification: Node pcmk-controller3[3] - state is now member (was lost)
That plus the lack of fencing explains:
Sep 09 09:35:21 [16391] controller2.sparklab pengine: info: LogActions: Leave galera:1 (Stopped)
Sep 09 09:35:21 [16391] controller2.sparklab pengine: info: LogActions: Leave galera:2 (Stopped)
Sep 09 09:35:21 [16391] controller2.sparklab pengine: notice: LogActions: Stop galera:0 (Master pcmk-controller2)
No wonder galera got messed up and wouldn't restart:
Sep 09 09:49:40 [16370] controller1.sparklab lrmd: notice: operation_finished: galera_promote_0:21691:stderr [ ocf-exit-reason:MySQL server failed to start (pid=21802) (rc=0), please check your installation ]
Sep 09 09:49:40 [16389] controller2.sparklab lrmd: notice: operation_finished: galera_promote_0:32213:stderr [ ocf-exit-reason:MySQL server failed to start (pid=32331) (rc=0), please check your installation ]
Sep 09 13:23:13 [16340] controller3.sparklab lrmd: notice: operation_finished: galera_monitor_10000:27507:stderr [ ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2) ]
Sep 09 13:23:13 [16340] controller3.sparklab lrmd: notice: operation_finished: galera_monitor_10000:27507:stderr [ ocf-exit-reason:local node <pcmk-controller3> is started, but not in primary mode. Unknown state. ]
Sep 09 13:23:13 [16340] controller3.sparklab lrmd: notice: operation_finished: galera_monitor_10000:27507:stderr [ ocf-exit-reason:Unable to retrieve wsrep_cluster_status, verify check_user 'monitor_user' has permissions to view status ]
Closing since there was no fencing configured (required for support).
|