Hide Forgot
Description of problem: failcount persists resource removal and redefinition causing undesired behaviour. Let's say I set up an apache resource without having apache installed. The agent is started and soon the failcount reaches INF effectively disabling the resource. If I remove the resource and redefine it it should have initially failcount set to 0. Version-Release number of selected component (if applicable): pacemaker-1.1.6-3.el6.x86_64 How reproducible: 100% Steps to Reproduce: 0. check that you don't have httpd installed 1. crm configure primitive webserver ocf:heartbeat:apache params configfile="/etc/httpd/conf/httpd.conf" 2. crm resource failcount webserver show node01 3. crm configure delete webserver 4. repeat 2 Actual results: step 4: $ crm resource failcount webserver show node01 scope=status name=fail-count-webserver value=INFINITY redefined resource will not start until crm resource webserver cleanup is called. Expected results: scope=status name=fail-count-webserver value=0 (seems like it's scoring unknown resources with 0, that's intentional?) when failed resource is deleted, all related info must be gone as well allowing redefinition and start from scratch. Additional info:
A related patch has been committed upstream: https://github.com/beekhof/pacemaker/commit/dbf1a62\n Medium: PE: Bug rhbz#789397 - Failcount and related info should be reset or removed when the resource is deleted
A related patch has been committed upstream: https://github.com/beekhof/pacemaker/commit/c26e624 Low: PE: Bug rhbz#789397 - Failcount and related info should be reset or removed when the resource is deleted (regression test)
# make sure httpd is not installed # crm configure primitive webserver ocf:heartbeat:apache params configfile=/etc/httpd/conf/httpd.conf op monitor interval=30s $ crm_mon -1 --inactive ============ Last updated: Thu Apr 5 10:01:30 2012 Last change: Thu Apr 5 10:01:24 2012 via cibadmin on m3c1-node01 Stack: cman Current DC: m3c1-node01 - partition with quorum Version: 1.1.7-5.el6-148fccfd5985c5590cc601123c6c16e966b85d14 3 Nodes configured, unknown expected votes 2 Resources configured. ============ Online: [ m3c1-node03 m3c1-node01 m3c1-node02 ] Full list of resources: virt-fencing (stonith:fence_xvm): Started m3c1-node01 webserver (ocf::heartbeat:apache): Stopped Failed actions: webserver_start_0 (node=m3c1-node03, call=4, rc=5, status=complete): not installed webserver_start_0 (node=m3c1-node02, call=4, rc=5, status=complete): not installed webserver_start_0 (node=m3c1-node01, call=5, rc=5, status=complete): not installed # failed as expected # for all nodes the same: $ crm resource failcount webserver show m3c1-node02 scope=status name=fail-count-webserver value=0 # so far so good # crm resource stop webserver; crm configure delete webserver # install httpd (yum -y install httpd) # crm configure primitive webserver ocf:heartbeat:apache params configfile=/etc/httpd/conf/httpd.conf op monitor interval=30s' And the result is the same, notice from logs that is no longer true: pengine[10316]: notice: unpack_rsc_op: Preventing webserver from re-starting on m3c1-node03: operation start failed 'not installed' (rc=5) So there are probably still some traces of previous failure. crm resource cleanup webserver fixes it. Moving back to ASSIGNED.
forgot to mention the version: pacemaker-1.1.7-5.el6.x86_64
There was an issue fixed upstream recently that will resolve what you are running into. The crm stores historical data of the last failure for a resource. That data was not being cleared correctly resulting in the failure appearing like it returned. Like you noticed, before this patch the only way to complete remove that historical data was to issue a crm_resource --C command. This upstream patch will fix this. https://github.com/ClusterLabs/pacemaker/commit/2f970a1a2a7c50c25ff06f2a7f87d66438bb0afb
works as expected with pacemaker-1.1.7-6.el6.x86_64 thank you :)
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: Un-tested use case Consequence: Records of previous failures of now deleted resources was preserved. Fix: Implement new feature
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2012-0846.html