1200756 – Disabled resources tend to trigger fencing of all cluster nodes

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1200756 - Disabled resources tend to trigger fencing of all cluster nodes

Summary: Disabled resources tend to trigger fencing of all cluster nodes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	7.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Fabio Massimo Di Nitto
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-03-11 10:55 UTC by Radek Steiger
Modified:	2015-11-19 04:46 UTC (History)
CC List:	6 users (show)
Fixed In Version:	resource-agents-3.9.5-44.el7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-11-19 04:46:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:2190	0	normal	SHIPPED_LIVE	resource-agents bug fix and enhancement update	2015-11-19 08:06:48 UTC

Description Radek Steiger 2015-03-11 10:55:22 UTC

> Description of problem:

The monitor function performed automatically on every existing (even disabled) resource seems to be too agressive in terms of the outcome for the cluster. One good example of this behavior is a simple typo in one of IPsrcaddr resource's option.

I created a resource called MyTypo and intentionally specified an invalid IP address of '192.168.1.411', obviously meant as '192.168.1.41' but making a typo like this is not tha uncommon. The important part here is the --disabled flag, which makes the resource never to be actually started (as pcs adds the relevant meta-attribute in the same step as creating the resource itself):

[root@virt-042 ~]# pcs resource create MyTypo IPsrcaddr ipaddress=192.168.1.411 --disabled

One would expect the resource to be just sitting there, but instead the cluster comes up with a fencing rampage, killing nodes one by one. The only lucky node here was virt-042 which happened not to have 'ifconfig' command installed which appears to be the reason for it to be spared. This is how the cluster status looked from virt-042 shortly after adding the MyTypo resource:

[root@virt-042 ~]# pcs status
Cluster name: r7cluster
Last updated: Wed Mar 11 11:05:37 2015
Last change: Wed Mar 11 11:05:24 2015
Stack: corosync
Current DC: virt-044 (4) - partition WITHOUT quorum
Version: 1.1.12-a14efad
4 Nodes configured
2 Resources configured

Node virt-044 (4): UNCLEAN (online)
Online: [ virt-042 ]
OFFLINE: [ virt-041 virt-043 ]

Full list of resources:
 Fencing	(stonith:fence_xvm):	Stopped 
 MyTypo	(ocf::heartbeat:IPsrcaddr):	FAILED virt-044 

Failed actions:
    MyTypo_stop_0 on virt-044 'not installed' (5): call=10, status=complete, exit-reason='We are not serving [192.168.1.411], hence can not make it a preferred source address', last-rc-change='Wed Mar 11 11:05:25 2015', queued=0ms, exec=95ms
    MyTypo_stop_0 on virt-044 'not installed' (5): call=10, status=complete, exit-reason='We are not serving [192.168.1.411], hence can not make it a preferred source address', last-rc-change='Wed Mar 11 11:05:25 2015', queued=0ms, exec=95ms
    MyTypo_monitor_0 on virt-042 'not installed' (5): call=9, status=complete, exit-reason='Setup problem: couldn't find command: ifconfig', last-rc-change='Wed Mar 11 11:05:24 2015', queued=0ms, exec=35ms
    MyTypo_monitor_0 on virt-042 'not installed' (5): call=9, status=complete, exit-reason='Setup problem: couldn't find command: ifconfig', last-rc-change='Wed Mar 11 11:05:24 2015', queued=0ms, exec=35ms


I'm not arguing here with reasons why every existing resource is being constantly monitored regardless of it's target status set by the admin, but having a resource causing a killing spree due to a typo when the resource was never actually meant to be started (target-role=Stopped right from the beginning) is something I do not see as a reasonable behavior.


> Version-Release number of selected component (if applicable):

[root@virt-042 ~]# rpm -q pacemaker resource-agents pcs
pacemaker-1.1.12-22.el7.x86_64
resource-agents-3.9.5-40.el7.x86_64
pcs-0.9.137-13.el7.x86_64


> How reproducible:

Always


> Steps to Reproduce:

1. have a cluster with proper fencing set up
2. add a resource, which is both disabled and will have an issue running a monitor operation


> Actual results:

Killing spree!


> Expected results:

No fencing, and ideally no failed monitor operation at all.

Comment 2 Andrew Beekhof 2015-03-23 23:21:04 UTC

> The monitor function performed automatically on every existing (even disabled) resource seems to be too agressive in terms of the outcome for the cluster.

One can argue about whether the results should be highlighted, but the operations are absolutely essential to preserving data integrity.

Just because a resource is disabled doesn't mean its not running on a given node.

It could have been started automatically at boot, it could be still running there after a pacemaker crash, the update that set it disabled may have also erased the cluster's memory of what was running on that node.  There are many possibilities.

Comment 3 Andrew Beekhof 2015-03-24 00:34:48 UTC

> The monitor function performed automatically on every existing (even disabled) resource seems to be too agressive in terms of the outcome for the cluster.

One can argue about whether the results should be highlighted, but the operations are absolutely essential to preserving data integrity.

Just because a resource is disabled doesn't mean its not running on a given node.

It could have been started automatically at boot, it could be still running there after a pacemaker crash, the update that set it disabled may have also erased the cluster's memory of what was running on that node.  There are many possibilities.


As to why the nodes are fenced, thats because the stop action fails.
Looks like the agent is simply broken (stopping a stopped resource must not be an error).

David: See https://github.com/ClusterLabs/resource-agents/commit/2db67faacac5435c6c0084dc85c4454fda40ce04#commitcomment-10357514

Bouncing to the RA package to fix

Comment 4 David Vossel 2015-03-24 15:01:03 UTC

(In reply to Andrew Beekhof from comment #3)
> As to why the nodes are fenced, thats because the stop action fails.
> Looks like the agent is simply broken (stopping a stopped resource must not
> be an error).
> 
> David: See
> https://github.com/ClusterLabs/resource-agents/commit/
> 2db67faacac5435c6c0084dc85c4454fda40ce04#commitcomment-10357514
> 
> Bouncing to the RA package to fix

ah yes. I see the issue. We need to be returning the correct error code on invalid configuration, otherwise pacemaker it looks like we are failing to stop the resource (which causes the node to be fenced)

-- David

Comment 5 David Vossel 2015-04-28 17:27:04 UTC

patches
https://github.com/ClusterLabs/resource-agents/pull/605

Unit test.

# export OCF_ROOT=/usr/lib/ocf/ OCF_RESKEY_ipaddress=1.168.122.41
# /usr/lib/ocf/resource.d/heartbeat/IPsrcaddr stop
ocf-exit-reason:We are not serving [1.168.122.41], hence can not make it a preferred source address
# echo $?
0

verify 'stop' returns 0 during a misconfiguration of the ipaddress.

Comment 7 michal novacek 2015-08-17 12:59:20 UTC

I have verified that stop action on non-managed ip address will return
correctly zero instead of non-zero with the IPsrcadd ra in
resource-agents-3.9.5-52.el7.x86_64.

----

[root@virt-068 ~]# export OCF_ROOT=/usr/lib/ocf/ OCF_RESKEY_ipaddress=1.168.122.41

AFTER THE FIX (resource-agents-3.9.5-52.el7.x86_64)
---------------------------------------------------
ocf-exit-reason:We are not serving [1.168.122.41], hence can not make it a preferred source address
[root@virt-068 ~]# echo $?
0

BEFORE THE FIX (resource-agents-3.9.5-40.el7.x86_64)
---------------------------------------------------
/usr/lib/ocf/resource.d/heartbeat/IPsrcaddr stop
ocf-exit-reason:We are not serving [1.168.122.41], hence can not make it a preferred source address
[root@virt-068 ~]# echo $?
5

Comment 10 errata-xmlrpc 2015-11-19 04:46:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2190.html

Note You need to log in before you can comment on or make changes to this bug.