Bug 1248132

Summary: rabbitmq-server resource not launching on node after failures.
Product: Red Hat Enterprise Linux 7 Reporter: Lee Yarwood <lyarwood>
Component: pacemakerAssignee: Andrew Beekhof <abeekhof>
Status: CLOSED CURRENTRELEASE QA Contact: Asaf Hirshberg <ahirshbe>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 7.1CC: abeekhof, cluster-maint, djansa, fdinitto, jruemker, oblaut, sputhenp
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: pacemaker-1.1.13-8.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-08 08:37:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
crm_resource -C -r rabbitmq-clone -VVVVV output
none
using crm_resource -C -r rabbitmq-clone -VVVVV with pacemaker-libs-1.1.13-8.el7.x86_64 none

Description Lee Yarwood 2015-07-29 16:40:04 UTC
Description of problem:
rabbitmq-server resource not launching on a node after previous failures. These previous failures were caused by the rabbitmq-server systemd service being enabled while the host was rebooted. 

Version-Release number of selected component (if applicable):
pacemaker-1.1.12-22.el7_1.2.x86_64

How reproducible:
Unclear.

Steps to Reproduce:
Unclear.

Actual results:
It appears that the resource is even not asked to start on the node in question by pacemaker. The following is seen in pacemarker.log even after `pcs resource cleanup abbitmq-server-clone` is called :

Jul 29 15:18:01 mac1234 pengine[2869]: warning: common_apply_stickiness: Forcing rabbitmq-server-clone away from pcmk-mac1234 after 1000000 failures (max=1000000)
Jul 29 15:18:01 mac1234 pengine[2869]: warning: common_apply_stickiness: Forcing rabbitmq-server-clone away from pcmk-mac1234 after 1000000 failures (max=1000000)
Jul 29 15:18:01 mac1234 pengine[2869]: warning: common_apply_stickiness: Forcing rabbitmq-server-clone away from pcmk-mac1234 after 1000000 failures (max=1000000)

Expected results:
The resource is cleaned up and able to start on a node after previous failures.

Additional info:

Comment 3 Andrew Beekhof 2015-07-29 20:09:09 UTC
Ban + clear wont achieve anything here

Comment 4 Andrew Beekhof 2015-07-29 20:36:01 UTC
So its not being started due to:

          <nvpair id="status-3-last-failure-rabbitmq-server" name="last-failure-rabbitmq-server" value="1438107693"/>
          <nvpair id="status-3-fail-count-rabbitmq-server" name="fail-count-rabbitmq-server" value="INFINITY"/>

Is `pcs resource cleanup abbitmq-server-clone` a typo only in the BZ, or is this the actual command that was run?

Comment 5 Lee Yarwood 2015-07-29 21:48:42 UTC
(In reply to Andrew Beekhof from comment #4)
> So its not being started due to:
> 
>           <nvpair id="status-3-last-failure-rabbitmq-server"
> name="last-failure-rabbitmq-server" value="1438107693"/>
>           <nvpair id="status-3-fail-count-rabbitmq-server"
> name="fail-count-rabbitmq-server" value="INFINITY"/>
> 
> Is `pcs resource cleanup abbitmq-server-clone` a typo only in the BZ, or is
> this the actual command that was run?

Just a typo, apologies.

Comment 6 Andrew Beekhof 2015-07-30 02:32:58 UTC
Could you try:

   crm_resource -C -r rabbitmq-server

and/or 

   `pcs resource cleanup rabbitmq-server`

then report the result of:

    cibadmin Q | grep fail-count-rabbitmq-server

Comment 7 Andrew Beekhof 2015-07-30 02:36:29 UTC
Some background... there was a time when pacemaker didn't automatically clean up fail-counts - but 7.1 should have that code.

Also, some versions of pcs and crm_resource didn't behave consistently depending on whether the supplied resource name did or did not include the "-clone" suffix.

Comment 8 Lee Yarwood 2015-07-30 09:00:23 UTC
`crm_resource -C -r rabbitmq-server` correctly cleaned the resource up and allowed it to run again on the host. `cibadmin Q | grep fail-count-rabbitmq-server` then reported no output.

I forgot to list that we had already attempted to run `pcs resource cleanup rabbitmq-server` a number of times without success.

Andrew, I'll leave the next steps in the bug to you as I'm not sure if there is anything to resolve here other than documenting this corner case.

Comment 9 Perry Myers 2015-07-30 11:41:37 UTC
@beekhof, this seems like a common case... someone puts a resource under pacemaker control but forgets to disable the respective systemd service.

Is there anything we can do proactively to prevent this error from occurring? i.e. Pacemaker somehow periodically checks to see if there are systemd units enabled that should not be and either disables them or very vocally warns the user that they should not have these systemd units enabled, since the service is under pacemaker control?

I think the workaround described above for dealing with what to do when/if this does happens is fine, but I'd like to make sure that we also prevent it from happening in the first place.

Comment 10 Andrew Beekhof 2015-07-30 23:07:06 UTC
Yes, there is a plan involving the systemd provider API which looks relevant.

It is however a non-trivial task without much precedent and details on the API are sketchy, so it will likely take someone hiding in a cave for a few weeks to get it done.

Hence why it hasn't happened yet.

Comment 11 Andrew Beekhof 2015-07-30 23:53:16 UTC
(In reply to Lee Yarwood from comment #8)
> `crm_resource -C -r rabbitmq-server` correctly cleaned the resource up and
> allowed it to run again on the host. `cibadmin Q | grep
> fail-count-rabbitmq-server` then reported no output.
> 
> I forgot to list that we had already attempted to run `pcs resource cleanup
> rabbitmq-server` a number of times without success.

That is very weird because looking at the pcs sources, 

    pcs resource cleanup rabbitmq-server

executes:

    crm_resource -C -r rabbitmq-server

> Andrew, I'll leave the next steps in the bug to you as I'm not sure if there
> is anything to resolve here other than documenting this corner case.

Next step here is to check that `crm_resource -C -r XYZ` and `crm_resource -C -r XYZ-clone` both correctly clear the relevant failcount.

After that, I may bump to the pcs guys to follow up on their side.

Comment 12 Andrew Beekhof 2015-08-03 01:49:35 UTC
This patch should do it:

https://github.com/beekhof/pacemaker/commit/181272b

Comment 15 Asaf Hirshberg 2015-10-07 14:32:14 UTC
Created attachment 1080695 [details]
crm_resource -C -r rabbitmq-clone -VVVVV output

Comment 16 Asaf Hirshberg 2015-10-07 14:37:54 UTC
Verified:
Using: 7.0-RHEL-7-director/2015-10-01.1
       resource-agents-3.9.5-52.el7.x86_64 # wasn't include in the puddle #
       pacemaker-1.1.12-22.el7_1.4.x86_64
       rabbitmq-server-3.3.5-5.el7ost.noarch


[root@overcloud-controller-1 ~]# pcs resource failcount show rabbitmq
Failcounts for rabbitmq
 overcloud-controller-1: 2
[root@overcloud-controller-1 ~]# crm_resource -C -r rabbitmq-clone 
Cleaning up rabbitmq:0 on overcloud-controller-0
Cleaning up rabbitmq:0 on overcloud-controller-1
Cleaning up rabbitmq:0 on overcloud-controller-2
Cleaning up rabbitmq:1 on overcloud-controller-0
Cleaning up rabbitmq:1 on overcloud-controller-1
Cleaning up rabbitmq:1 on overcloud-controller-2
Cleaning up rabbitmq:2 on overcloud-controller-0
Cleaning up rabbitmq:2 on overcloud-controller-1
Cleaning up rabbitmq:2 on overcloud-controller-2
Waiting for 9 replies from the CRMd......... OK
[root@overcloud-controller-1 ~]# pcs resource failcount show rabbitmq
No failcounts for rabbitmq

Full output of "crm_resource -C -r rabbitmq-clone -VVVVV" is attached above.

Steps: 1) kill -9 <rabbitmq PID>
       2) pcs resource failcount show rabbitmq
       3) crm_resource -C -r rabbitmq-clone

Comment 17 Asaf Hirshberg 2015-10-08 06:52:04 UTC
Fix: forgot update pacemaker-1.1.13-8.el7.x86_64

verified, new results:

[root@overcloud-controller-0 ~]# pcs resource failcount show rabbitmq
Failcounts for rabbitmq
 overcloud-controller-0: 2
[root@overcloud-controller-0 ~]# crm_resource -C -r rabbitmq-clone
Cleaning up rabbitmq:0 on overcloud-controller-0, removing fail-count-rabbitmq
Cleaning up rabbitmq:0 on overcloud-controller-1, removing fail-count-rabbitmq
Cleaning up rabbitmq:0 on overcloud-controller-2, removing fail-count-rabbitmq
Waiting for 3 replies from the CRMd... OK
[root@overcloud-controller-0 ~]# 

Full output of "crm_resource -C -r rabbitmq-clone -VVVVV" is attached above.

Comment 18 Asaf Hirshberg 2015-10-08 06:53:03 UTC
Created attachment 1080863 [details]
using crm_resource -C -r rabbitmq-clone -VVVVV with pacemaker-libs-1.1.13-8.el7.x86_64