Bug 1248132 - rabbitmq-server resource not launching on node after failures.
rabbitmq-server resource not launching on node after failures.
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker (Show other bugs)
7.1
x86_64 Linux
unspecified Severity urgent
: rc
: ---
Assigned To: Andrew Beekhof
Asaf Hirshberg
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-07-29 12:40 EDT by Lee Yarwood
Modified: 2016-11-08 03:37 EST (History)
7 users (show)

See Also:
Fixed In Version: pacemaker-1.1.13-8.el7
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-11-08 03:37:43 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
crm_resource -C -r rabbitmq-clone -VVVVV output (66.91 KB, text/plain)
2015-10-07 10:32 EDT, Asaf Hirshberg
no flags Details
using crm_resource -C -r rabbitmq-clone -VVVVV with pacemaker-libs-1.1.13-8.el7.x86_64 (16.93 KB, text/plain)
2015-10-08 02:53 EDT, Asaf Hirshberg
no flags Details

  None (edit)
Description Lee Yarwood 2015-07-29 12:40:04 EDT
Description of problem:
rabbitmq-server resource not launching on a node after previous failures. These previous failures were caused by the rabbitmq-server systemd service being enabled while the host was rebooted. 

Version-Release number of selected component (if applicable):
pacemaker-1.1.12-22.el7_1.2.x86_64

How reproducible:
Unclear.

Steps to Reproduce:
Unclear.

Actual results:
It appears that the resource is even not asked to start on the node in question by pacemaker. The following is seen in pacemarker.log even after `pcs resource cleanup abbitmq-server-clone` is called :

Jul 29 15:18:01 mac1234 pengine[2869]: warning: common_apply_stickiness: Forcing rabbitmq-server-clone away from pcmk-mac1234 after 1000000 failures (max=1000000)
Jul 29 15:18:01 mac1234 pengine[2869]: warning: common_apply_stickiness: Forcing rabbitmq-server-clone away from pcmk-mac1234 after 1000000 failures (max=1000000)
Jul 29 15:18:01 mac1234 pengine[2869]: warning: common_apply_stickiness: Forcing rabbitmq-server-clone away from pcmk-mac1234 after 1000000 failures (max=1000000)

Expected results:
The resource is cleaned up and able to start on a node after previous failures.

Additional info:
Comment 3 Andrew Beekhof 2015-07-29 16:09:09 EDT
Ban + clear wont achieve anything here
Comment 4 Andrew Beekhof 2015-07-29 16:36:01 EDT
So its not being started due to:

          <nvpair id="status-3-last-failure-rabbitmq-server" name="last-failure-rabbitmq-server" value="1438107693"/>
          <nvpair id="status-3-fail-count-rabbitmq-server" name="fail-count-rabbitmq-server" value="INFINITY"/>

Is `pcs resource cleanup abbitmq-server-clone` a typo only in the BZ, or is this the actual command that was run?
Comment 5 Lee Yarwood 2015-07-29 17:48:42 EDT
(In reply to Andrew Beekhof from comment #4)
> So its not being started due to:
> 
>           <nvpair id="status-3-last-failure-rabbitmq-server"
> name="last-failure-rabbitmq-server" value="1438107693"/>
>           <nvpair id="status-3-fail-count-rabbitmq-server"
> name="fail-count-rabbitmq-server" value="INFINITY"/>
> 
> Is `pcs resource cleanup abbitmq-server-clone` a typo only in the BZ, or is
> this the actual command that was run?

Just a typo, apologies.
Comment 6 Andrew Beekhof 2015-07-29 22:32:58 EDT
Could you try:

   crm_resource -C -r rabbitmq-server

and/or 

   `pcs resource cleanup rabbitmq-server`

then report the result of:

    cibadmin Q | grep fail-count-rabbitmq-server
Comment 7 Andrew Beekhof 2015-07-29 22:36:29 EDT
Some background... there was a time when pacemaker didn't automatically clean up fail-counts - but 7.1 should have that code.

Also, some versions of pcs and crm_resource didn't behave consistently depending on whether the supplied resource name did or did not include the "-clone" suffix.
Comment 8 Lee Yarwood 2015-07-30 05:00:23 EDT
`crm_resource -C -r rabbitmq-server` correctly cleaned the resource up and allowed it to run again on the host. `cibadmin Q | grep fail-count-rabbitmq-server` then reported no output.

I forgot to list that we had already attempted to run `pcs resource cleanup rabbitmq-server` a number of times without success.

Andrew, I'll leave the next steps in the bug to you as I'm not sure if there is anything to resolve here other than documenting this corner case.
Comment 9 Perry Myers 2015-07-30 07:41:37 EDT
@beekhof, this seems like a common case... someone puts a resource under pacemaker control but forgets to disable the respective systemd service.

Is there anything we can do proactively to prevent this error from occurring? i.e. Pacemaker somehow periodically checks to see if there are systemd units enabled that should not be and either disables them or very vocally warns the user that they should not have these systemd units enabled, since the service is under pacemaker control?

I think the workaround described above for dealing with what to do when/if this does happens is fine, but I'd like to make sure that we also prevent it from happening in the first place.
Comment 10 Andrew Beekhof 2015-07-30 19:07:06 EDT
Yes, there is a plan involving the systemd provider API which looks relevant.

It is however a non-trivial task without much precedent and details on the API are sketchy, so it will likely take someone hiding in a cave for a few weeks to get it done.

Hence why it hasn't happened yet.
Comment 11 Andrew Beekhof 2015-07-30 19:53:16 EDT
(In reply to Lee Yarwood from comment #8)
> `crm_resource -C -r rabbitmq-server` correctly cleaned the resource up and
> allowed it to run again on the host. `cibadmin Q | grep
> fail-count-rabbitmq-server` then reported no output.
> 
> I forgot to list that we had already attempted to run `pcs resource cleanup
> rabbitmq-server` a number of times without success.

That is very weird because looking at the pcs sources, 

    pcs resource cleanup rabbitmq-server

executes:

    crm_resource -C -r rabbitmq-server

> Andrew, I'll leave the next steps in the bug to you as I'm not sure if there
> is anything to resolve here other than documenting this corner case.

Next step here is to check that `crm_resource -C -r XYZ` and `crm_resource -C -r XYZ-clone` both correctly clear the relevant failcount.

After that, I may bump to the pcs guys to follow up on their side.
Comment 12 Andrew Beekhof 2015-08-02 21:49:35 EDT
This patch should do it:

https://github.com/beekhof/pacemaker/commit/181272b
Comment 15 Asaf Hirshberg 2015-10-07 10:32 EDT
Created attachment 1080695 [details]
crm_resource -C -r rabbitmq-clone -VVVVV output
Comment 16 Asaf Hirshberg 2015-10-07 10:37:54 EDT
Verified:
Using: 7.0-RHEL-7-director/2015-10-01.1
       resource-agents-3.9.5-52.el7.x86_64 # wasn't include in the puddle #
       pacemaker-1.1.12-22.el7_1.4.x86_64
       rabbitmq-server-3.3.5-5.el7ost.noarch


[root@overcloud-controller-1 ~]# pcs resource failcount show rabbitmq
Failcounts for rabbitmq
 overcloud-controller-1: 2
[root@overcloud-controller-1 ~]# crm_resource -C -r rabbitmq-clone 
Cleaning up rabbitmq:0 on overcloud-controller-0
Cleaning up rabbitmq:0 on overcloud-controller-1
Cleaning up rabbitmq:0 on overcloud-controller-2
Cleaning up rabbitmq:1 on overcloud-controller-0
Cleaning up rabbitmq:1 on overcloud-controller-1
Cleaning up rabbitmq:1 on overcloud-controller-2
Cleaning up rabbitmq:2 on overcloud-controller-0
Cleaning up rabbitmq:2 on overcloud-controller-1
Cleaning up rabbitmq:2 on overcloud-controller-2
Waiting for 9 replies from the CRMd......... OK
[root@overcloud-controller-1 ~]# pcs resource failcount show rabbitmq
No failcounts for rabbitmq

Full output of "crm_resource -C -r rabbitmq-clone -VVVVV" is attached above.

Steps: 1) kill -9 <rabbitmq PID>
       2) pcs resource failcount show rabbitmq
       3) crm_resource -C -r rabbitmq-clone
Comment 17 Asaf Hirshberg 2015-10-08 02:52:04 EDT
Fix: forgot update pacemaker-1.1.13-8.el7.x86_64

verified, new results:

[root@overcloud-controller-0 ~]# pcs resource failcount show rabbitmq
Failcounts for rabbitmq
 overcloud-controller-0: 2
[root@overcloud-controller-0 ~]# crm_resource -C -r rabbitmq-clone
Cleaning up rabbitmq:0 on overcloud-controller-0, removing fail-count-rabbitmq
Cleaning up rabbitmq:0 on overcloud-controller-1, removing fail-count-rabbitmq
Cleaning up rabbitmq:0 on overcloud-controller-2, removing fail-count-rabbitmq
Waiting for 3 replies from the CRMd... OK
[root@overcloud-controller-0 ~]# 

Full output of "crm_resource -C -r rabbitmq-clone -VVVVV" is attached above.
Comment 18 Asaf Hirshberg 2015-10-08 02:53 EDT
Created attachment 1080863 [details]
using crm_resource -C -r rabbitmq-clone -VVVVV with pacemaker-libs-1.1.13-8.el7.x86_64

Note You need to log in before you can comment on or make changes to this bug.