Bug 1200756

Summary:	Disabled resources tend to trigger fencing of all cluster nodes
Product:	Red Hat Enterprise Linux 7	Reporter:	Radek Steiger <rsteiger>
Component:	resource-agents	Assignee:	Fabio Massimo Di Nitto <fdinitto>
Status:	CLOSED ERRATA	QA Contact:	cluster-qe <cluster-qe>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	7.1	CC:	agk, cluster-maint, fdinitto, hellya.wang, jkortus, mnovacek
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	resource-agents-3.9.5-44.el7	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-11-19 04:46:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Radek Steiger 2015-03-11 10:55:22 UTC

> Description of problem:

The monitor function performed automatically on every existing (even disabled) resource seems to be too agressive in terms of the outcome for the cluster. One good example of this behavior is a simple typo in one of IPsrcaddr resource's option.

I created a resource called MyTypo and intentionally specified an invalid IP address of '192.168.1.411', obviously meant as '192.168.1.41' but making a typo like this is not tha uncommon. The important part here is the --disabled flag, which makes the resource never to be actually started (as pcs adds the relevant meta-attribute in the same step as creating the resource itself):

[root@virt-042 ~]# pcs resource create MyTypo IPsrcaddr ipaddress=192.168.1.411 --disabled

One would expect the resource to be just sitting there, but instead the cluster comes up with a fencing rampage, killing nodes one by one. The only lucky node here was virt-042 which happened not to have 'ifconfig' command installed which appears to be the reason for it to be spared. This is how the cluster status looked from virt-042 shortly after adding the MyTypo resource:

[root@virt-042 ~]# pcs status
Cluster name: r7cluster
Last updated: Wed Mar 11 11:05:37 2015
Last change: Wed Mar 11 11:05:24 2015
Stack: corosync
Current DC: virt-044 (4) - partition WITHOUT quorum
Version: 1.1.12-a14efad
4 Nodes configured
2 Resources configured

Node virt-044 (4): UNCLEAN (online)
Online: [ virt-042 ]
OFFLINE: [ virt-041 virt-043 ]

Full list of resources:
 Fencing	(stonith:fence_xvm):	Stopped 
 MyTypo	(ocf::heartbeat:IPsrcaddr):	FAILED virt-044 

Failed actions:
    MyTypo_stop_0 on virt-044 'not installed' (5): call=10, status=complete, exit-reason='We are not serving [192.168.1.411], hence can not make it a preferred source address', last-rc-change='Wed Mar 11 11:05:25 2015', queued=0ms, exec=95ms
    MyTypo_stop_0 on virt-044 'not installed' (5): call=10, status=complete, exit-reason='We are not serving [192.168.1.411], hence can not make it a preferred source address', last-rc-change='Wed Mar 11 11:05:25 2015', queued=0ms, exec=95ms
    MyTypo_monitor_0 on virt-042 'not installed' (5): call=9, status=complete, exit-reason='Setup problem: couldn't find command: ifconfig', last-rc-change='Wed Mar 11 11:05:24 2015', queued=0ms, exec=35ms
    MyTypo_monitor_0 on virt-042 'not installed' (5): call=9, status=complete, exit-reason='Setup problem: couldn't find command: ifconfig', last-rc-change='Wed Mar 11 11:05:24 2015', queued=0ms, exec=35ms


I'm not arguing here with reasons why every existing resource is being constantly monitored regardless of it's target status set by the admin, but having a resource causing a killing spree due to a typo when the resource was never actually meant to be started (target-role=Stopped right from the beginning) is something I do not see as a reasonable behavior.


> Version-Release number of selected component (if applicable):

[root@virt-042 ~]# rpm -q pacemaker resource-agents pcs
pacemaker-1.1.12-22.el7.x86_64
resource-agents-3.9.5-40.el7.x86_64
pcs-0.9.137-13.el7.x86_64


> How reproducible:

Always


> Steps to Reproduce:

1. have a cluster with proper fencing set up
2. add a resource, which is both disabled and will have an issue running a monitor operation


> Actual results:

Killing spree!


> Expected results:

No fencing, and ideally no failed monitor operation at all.

Comment 2 Andrew Beekhof 2015-03-23 23:21:04 UTC

> The monitor function performed automatically on every existing (even disabled) resource seems to be too agressive in terms of the outcome for the cluster.

One can argue about whether the results should be highlighted, but the operations are absolutely essential to preserving data integrity.

Just because a resource is disabled doesn't mean its not running on a given node.

It could have been started automatically at boot, it could be still running there after a pacemaker crash, the update that set it disabled may have also erased the cluster's memory of what was running on that node.  There are many possibilities.

Comment 3 Andrew Beekhof 2015-03-24 00:34:48 UTC

> The monitor function performed automatically on every existing (even disabled) resource seems to be too agressive in terms of the outcome for the cluster.

One can argue about whether the results should be highlighted, but the operations are absolutely essential to preserving data integrity.

Just because a resource is disabled doesn't mean its not running on a given node.

It could have been started automatically at boot, it could be still running there after a pacemaker crash, the update that set it disabled may have also erased the cluster's memory of what was running on that node.  There are many possibilities.


As to why the nodes are fenced, thats because the stop action fails.
Looks like the agent is simply broken (stopping a stopped resource must not be an error).

David: See https://github.com/ClusterLabs/resource-agents/commit/2db67faacac5435c6c0084dc85c4454fda40ce04#commitcomment-10357514

Bouncing to the RA package to fix

Comment 4 David Vossel 2015-03-24 15:01:03 UTC

(In reply to Andrew Beekhof from comment #3)
> As to why the nodes are fenced, thats because the stop action fails.
> Looks like the agent is simply broken (stopping a stopped resource must not
> be an error).
> 
> David: See
> https://github.com/ClusterLabs/resource-agents/commit/
> 2db67faacac5435c6c0084dc85c4454fda40ce04#commitcomment-10357514
> 
> Bouncing to the RA package to fix

ah yes. I see the issue. We need to be returning the correct error code on invalid configuration, otherwise pacemaker it looks like we are failing to stop the resource (which causes the node to be fenced)

-- David

Comment 5 David Vossel 2015-04-28 17:27:04 UTC

patches
https://github.com/ClusterLabs/resource-agents/pull/605

Unit test.

# export OCF_ROOT=/usr/lib/ocf/ OCF_RESKEY_ipaddress=1.168.122.41
# /usr/lib/ocf/resource.d/heartbeat/IPsrcaddr stop
ocf-exit-reason:We are not serving [1.168.122.41], hence can not make it a preferred source address
# echo $?
0

verify 'stop' returns 0 during a misconfiguration of the ipaddress.

Comment 7 michal novacek 2015-08-17 12:59:20 UTC

I have verified that stop action on non-managed ip address will return
correctly zero instead of non-zero with the IPsrcadd ra in
resource-agents-3.9.5-52.el7.x86_64.

----

[root@virt-068 ~]# export OCF_ROOT=/usr/lib/ocf/ OCF_RESKEY_ipaddress=1.168.122.41

AFTER THE FIX (resource-agents-3.9.5-52.el7.x86_64)
---------------------------------------------------
ocf-exit-reason:We are not serving [1.168.122.41], hence can not make it a preferred source address
[root@virt-068 ~]# echo $?
0

BEFORE THE FIX (resource-agents-3.9.5-40.el7.x86_64)
---------------------------------------------------
/usr/lib/ocf/resource.d/heartbeat/IPsrcaddr stop
ocf-exit-reason:We are not serving [1.168.122.41], hence can not make it a preferred source address
[root@virt-068 ~]# echo $?
5

Comment 10 errata-xmlrpc 2015-11-19 04:46:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2190.html