Bug 789397

Summary: Failcount and related info should be reset or removed when the resource is deleted
Product: Red Hat Enterprise Linux 6 Reporter: Jaroslav Kortus <jkortus>
Component: pacemakerAssignee: Andrew Beekhof <abeekhof>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: low Docs Contact:
Priority: low    
Version: 6.2CC: cluster-maint, dvossel
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pacemaker-1.1.7-6.el6 Doc Type: Bug Fix
Doc Text:
Cause: Un-tested use case Consequence: Records of previous failures of now deleted resources was preserved. Fix: Implement new feature
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-20 13:48:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Jaroslav Kortus 2012-02-10 17:00:43 UTC
Description of problem:
failcount persists resource removal and redefinition causing undesired behaviour.

Let's say I set up an apache resource without having apache installed. The agent is started and soon the failcount reaches INF effectively disabling the resource.

If I remove the resource and redefine it it should have initially failcount set to 0.

Version-Release number of selected component (if applicable):
pacemaker-1.1.6-3.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
0. check that you don't have httpd installed
1. crm configure primitive webserver ocf:heartbeat:apache params configfile="/etc/httpd/conf/httpd.conf"
2. crm resource failcount webserver show node01
3. crm configure delete webserver
4. repeat 2
  
Actual results:
step 4:
$ crm resource failcount webserver show node01
scope=status  name=fail-count-webserver value=INFINITY

redefined resource will not start until crm resource webserver cleanup is called.

Expected results:
scope=status  name=fail-count-webserver value=0
(seems like it's scoring unknown resources with 0, that's intentional?)
when failed resource is deleted, all related info must be gone as well allowing redefinition and start from scratch.

Additional info:

Comment 3 Andrew Beekhof 2012-03-01 01:16:33 UTC
A related patch has been committed upstream: https://github.com/beekhof/pacemaker/commit/dbf1a62\n  Medium: PE: Bug rhbz#789397 - Failcount and related info should be reset or removed when the resource is deleted

Comment 4 Andrew Beekhof 2012-03-01 01:17:54 UTC
A related patch has been committed upstream: https://github.com/beekhof/pacemaker/commit/c26e624   Low: PE: Bug rhbz#789397 - Failcount and related info should be reset or removed when the resource is deleted (regression test)

Comment 6 Jaroslav Kortus 2012-04-05 15:07:43 UTC
# make sure httpd is not installed

# crm configure primitive webserver ocf:heartbeat:apache params configfile=/etc/httpd/conf/httpd.conf op monitor interval=30s
$ crm_mon -1 --inactive
============
Last updated: Thu Apr  5 10:01:30 2012
Last change: Thu Apr  5 10:01:24 2012 via cibadmin on m3c1-node01
Stack: cman
Current DC: m3c1-node01 - partition with quorum
Version: 1.1.7-5.el6-148fccfd5985c5590cc601123c6c16e966b85d14
3 Nodes configured, unknown expected votes
2 Resources configured.
============

Online: [ m3c1-node03 m3c1-node01 m3c1-node02 ]

Full list of resources:

 virt-fencing   (stonith:fence_xvm):    Started m3c1-node01
 webserver      (ocf::heartbeat:apache):        Stopped 

Failed actions:
    webserver_start_0 (node=m3c1-node03, call=4, rc=5, status=complete): not installed
    webserver_start_0 (node=m3c1-node02, call=4, rc=5, status=complete): not installed
    webserver_start_0 (node=m3c1-node01, call=5, rc=5, status=complete): not installed

# failed as expected

# for all nodes the same:
$ crm resource failcount webserver show m3c1-node02
scope=status  name=fail-count-webserver value=0

# so far so good
# crm resource stop webserver; crm configure delete webserver

# install httpd (yum -y install httpd)
# crm configure primitive webserver ocf:heartbeat:apache params configfile=/etc/httpd/conf/httpd.conf op monitor interval=30s'

And the result is the same, notice from logs that is no longer true:
 pengine[10316]:   notice: unpack_rsc_op: Preventing webserver from re-starting on m3c1-node03: operation start failed 'not installed' (rc=5)

So there are probably still some traces of previous failure.
crm resource cleanup webserver fixes it.

Moving back to ASSIGNED.

Comment 7 Jaroslav Kortus 2012-04-05 15:08:46 UTC
forgot to mention the version: pacemaker-1.1.7-5.el6.x86_64

Comment 8 David Vossel 2012-04-10 16:54:07 UTC
There was an issue fixed upstream recently that will resolve what you are running into.  The crm stores historical data of the last failure for a resource.  That data was not being cleared correctly resulting in the failure appearing like it returned.  Like you noticed, before this patch the only way to complete remove that historical data was to issue a crm_resource --C command.

This upstream patch will fix this.
https://github.com/ClusterLabs/pacemaker/commit/2f970a1a2a7c50c25ff06f2a7f87d66438bb0afb

Comment 10 Jaroslav Kortus 2012-04-19 13:05:03 UTC
works as expected with pacemaker-1.1.7-6.el6.x86_64

thank you :)

Comment 11 Andrew Beekhof 2012-05-08 11:35:45 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: Un-tested use case
Consequence: Records of previous failures of now deleted resources was preserved.
Fix: Implement new feature

Comment 13 errata-xmlrpc 2012-06-20 13:48:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0846.html