Bug 1200756
Summary: | Disabled resources tend to trigger fencing of all cluster nodes | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Radek Steiger <rsteiger> |
Component: | resource-agents | Assignee: | Fabio Massimo Di Nitto <fdinitto> |
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.1 | CC: | agk, cluster-maint, fdinitto, hellya.wang, jkortus, mnovacek |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | resource-agents-3.9.5-44.el7 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-11-19 04:46:58 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Radek Steiger
2015-03-11 10:55:22 UTC
> The monitor function performed automatically on every existing (even disabled) resource seems to be too agressive in terms of the outcome for the cluster.
One can argue about whether the results should be highlighted, but the operations are absolutely essential to preserving data integrity.
Just because a resource is disabled doesn't mean its not running on a given node.
It could have been started automatically at boot, it could be still running there after a pacemaker crash, the update that set it disabled may have also erased the cluster's memory of what was running on that node. There are many possibilities.
> The monitor function performed automatically on every existing (even disabled) resource seems to be too agressive in terms of the outcome for the cluster. One can argue about whether the results should be highlighted, but the operations are absolutely essential to preserving data integrity. Just because a resource is disabled doesn't mean its not running on a given node. It could have been started automatically at boot, it could be still running there after a pacemaker crash, the update that set it disabled may have also erased the cluster's memory of what was running on that node. There are many possibilities. As to why the nodes are fenced, thats because the stop action fails. Looks like the agent is simply broken (stopping a stopped resource must not be an error). David: See https://github.com/ClusterLabs/resource-agents/commit/2db67faacac5435c6c0084dc85c4454fda40ce04#commitcomment-10357514 Bouncing to the RA package to fix (In reply to Andrew Beekhof from comment #3) > As to why the nodes are fenced, thats because the stop action fails. > Looks like the agent is simply broken (stopping a stopped resource must not > be an error). > > David: See > https://github.com/ClusterLabs/resource-agents/commit/ > 2db67faacac5435c6c0084dc85c4454fda40ce04#commitcomment-10357514 > > Bouncing to the RA package to fix ah yes. I see the issue. We need to be returning the correct error code on invalid configuration, otherwise pacemaker it looks like we are failing to stop the resource (which causes the node to be fenced) -- David patches https://github.com/ClusterLabs/resource-agents/pull/605 Unit test. # export OCF_ROOT=/usr/lib/ocf/ OCF_RESKEY_ipaddress=1.168.122.41 # /usr/lib/ocf/resource.d/heartbeat/IPsrcaddr stop ocf-exit-reason:We are not serving [1.168.122.41], hence can not make it a preferred source address # echo $? 0 verify 'stop' returns 0 during a misconfiguration of the ipaddress. I have verified that stop action on non-managed ip address will return correctly zero instead of non-zero with the IPsrcadd ra in resource-agents-3.9.5-52.el7.x86_64. ---- [root@virt-068 ~]# export OCF_ROOT=/usr/lib/ocf/ OCF_RESKEY_ipaddress=1.168.122.41 AFTER THE FIX (resource-agents-3.9.5-52.el7.x86_64) --------------------------------------------------- ocf-exit-reason:We are not serving [1.168.122.41], hence can not make it a preferred source address [root@virt-068 ~]# echo $? 0 BEFORE THE FIX (resource-agents-3.9.5-40.el7.x86_64) --------------------------------------------------- /usr/lib/ocf/resource.d/heartbeat/IPsrcaddr stop ocf-exit-reason:We are not serving [1.168.122.41], hence can not make it a preferred source address [root@virt-068 ~]# echo $? 5 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-2190.html |