Bug 1263981

Summary:	pacemaker does not recover galera resource on Master controller & leaves it 'unmanaged'
Product:	Red Hat Enterprise Linux 7	Reporter:	Jaison Raju <jraju>
Component:	pacemaker	Assignee:	Andrew Beekhof <abeekhof>
Status:	CLOSED NOTABUG	QA Contact:	cluster-qe <cluster-qe>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	7.1	CC:	cluster-maint, jraju
Target Milestone:	rc	Keywords:	Unconfirmed
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-09-24 00:20:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jaison Raju 2015-09-17 09:06:00 UTC

Description of problem:
After a network outage , galera resource goes down , although cannot be recovered on the Master node which is shown as unmanaged .

Resource clear / cleanup / restart did not fix this issue .
Not able to change the resource status to managed .

After putting the cluster to standby/unstanby & resource restart & cleanup fix this issue .


Version-Release number of selected component (if applicable):
pacemaker-1.1.12-22.el7_1.4.x86_64
RHOS6

How reproducible:
No

Steps to Reproduce:
1.
2.
3.

Actual results:
Following are some things i noticed while working on this issue .
i. mariadb was running on controller1/3 as slaves & not on controller2 .
'pcs status' :
"galera (ocf:heartbeat:galera):                FAILED Master pcml-controler2 (unmanaged)"
pacemaker.log:
Sep 09 15:08:16 [16391] controller2.sparklab    pengine:   notice: can_be_master:	Forcing unmanaged master galera:2 to remain promoted on pcmk-controller1
Sep 09 15:08:16 [16391] controller2.sparklab    pengine:     info: master_color:	Promoting galera:2 (Master pcmk-controller1)
Sep 09 15:08:16 [16391] controller2.sparklab    pengine:     info: master_color:	Promoting galera:1 (Slave pcmk-controller3)
Sep 09 15:08:16 [16391] controller2.sparklab    pengine:     info: master_color:	galera-master: Promoted 3 instances of a possible 3 to master
Sep 09 15:08:16 [16391] controller2.sparklab    pengine:     info: RecurringOp:	 Start recurring monitor (10s) for galera:1 on pcmk-controller3
Sep 09 15:08:16 [16391] controller2.sparklab    pengine:     info: RecurringOp:	 Start recurring monitor (10s) for galera:1 on pcmk-controller3
Sep 09 15:08:16 [16391] controller2.sparklab    pengine:     info: LogActions:	Leave   ip-galera-pub-192.168.242.32	(Started pcmk-controller3)
Sep 09 15:08:16 [16391] controller2.sparklab    pengine:     info: LogActions:	Leave   galera:0	(Master unmanaged)
Sep 09 15:08:16 [16391] controller2.sparklab    pengine:   notice: LogActions:	Recover galera:1	(Slave pcmk-controller3)
Sep 09 15:08:16 [16391] controller2.sparklab    pengine:   notice: LogActions:	Promote galera:1	(Slave -> Master pcmk-controller3)
Sep 09 15:08:16 [16391] controller2.sparklab    pengine:     info: LogActions:	Leave   galera:2	(Master unmanaged)

ii. Manage resource galera / galera-master did not change the status of galera resource on controller2 .

iii. We tried "pcs resource clear galera pcmk-controller2" , "pcs resource clear galera-master pcmk-controller2 " "pcs resource cleanup" & "pcs resource restart galera" , although this did not start the resource on controller2 .  

iv. We put the cluster to standby & unstandby . ( pcs cluster standby --all / pcs cluster unstandby --all )
This didn't bring up galera on any node .

v. We debug-start the galera resource .
This started galera on controller1 & controller3 ., Ther are tagged as Masters in pcs status .

vi. We ran resource restart galera on controller2 .

vii. We disabled / enabled galera resoure .
# pcs resource disable galera
# pcs resource enable galera

viii. After a resource clean up , the resource has come up on controller2 .

Expected results:
pacemaker is expected to recover the resource once network is up & the resource is cleanup .

Additional info:
This issue noticed was more specifically related to pacemaker behaviour than a mariadb / galera issue , as not error where logged by mariadb / galera .

Comment 3 Andrew Beekhof 2015-09-24 00:20:25 UTC

> Resource clear / cleanup / restart did not fix this issue .

These commands only do something useful if the underlying problem was either fixed or transient in nature.

It looks like either the network or corosync had a full meltdown, no node was able to see any of its peers:

Sep 09 09:35:48 [16343] controller3.sparklab       crmd:   notice: crm_update_peer_state:	pcmk_quorum_notification: Node pcmk-controller1[1] - state is now member (was lost)
Sep 09 09:35:48 [16343] controller3.sparklab       crmd:   notice: crm_update_peer_state:	pcmk_quorum_notification: Node pcmk-controller2[2] - state is now member (was lost)

Sep 09 09:35:48 [16373] controller1.sparklab       crmd:   notice: crm_update_peer_state:	pcmk_quorum_notification: Node pcmk-controller2[2] - state is now member (was lost)
Sep 09 09:35:48 [16373] controller1.sparklab       crmd:   notice: crm_update_peer_state:	pcmk_quorum_notification: Node pcmk-controller3[3] - state is now member (was lost)

Sep 09 09:35:48 [16392] controller2.sparklab       crmd:   notice: crm_update_peer_state:	pcmk_quorum_notification: Node pcmk-controller1[1] - state is now member (was lost)
Sep 09 09:35:48 [16392] controller2.sparklab       crmd:   notice: crm_update_peer_state:	pcmk_quorum_notification: Node pcmk-controller3[3] - state is now member (was lost)

That plus the lack of fencing explains:

Sep 09 09:35:21 [16391] controller2.sparklab    pengine:     info: LogActions:	Leave   galera:1	(Stopped)
Sep 09 09:35:21 [16391] controller2.sparklab    pengine:     info: LogActions:	Leave   galera:2	(Stopped)
Sep 09 09:35:21 [16391] controller2.sparklab    pengine:   notice: LogActions:	Stop    galera:0	(Master pcmk-controller2)

No wonder galera got messed up and wouldn't restart:

Sep 09 09:49:40 [16370] controller1.sparklab       lrmd:   notice: operation_finished:	galera_promote_0:21691:stderr [ ocf-exit-reason:MySQL server failed to start (pid=21802) (rc=0), please check your installation ]
Sep 09 09:49:40 [16389] controller2.sparklab       lrmd:   notice: operation_finished:	galera_promote_0:32213:stderr [ ocf-exit-reason:MySQL server failed to start (pid=32331) (rc=0), please check your installation ]
Sep 09 13:23:13 [16340] controller3.sparklab       lrmd:   notice: operation_finished:	galera_monitor_10000:27507:stderr [ ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2) ]
Sep 09 13:23:13 [16340] controller3.sparklab       lrmd:   notice: operation_finished:	galera_monitor_10000:27507:stderr [ ocf-exit-reason:local node <pcmk-controller3> is started, but not in primary mode. Unknown state. ]
Sep 09 13:23:13 [16340] controller3.sparklab       lrmd:   notice: operation_finished:	galera_monitor_10000:27507:stderr [ ocf-exit-reason:Unable to retrieve wsrep_cluster_status, verify check_user 'monitor_user' has permissions to view status ]

Closing since there was no fencing configured (required for support).