1248132 – rabbitmq-server resource not launching on node after failures.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1248132 - rabbitmq-server resource not launching on node after failures.

Summary: rabbitmq-server resource not launching on node after failures.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Andrew Beekhof
QA Contact:	Asaf Hirshberg
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-07-29 16:40 UTC by Lee Yarwood
Modified:	2019-08-15 05:00 UTC (History)
CC List:	7 users (show)
Fixed In Version:	pacemaker-1.1.13-8.el7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-11-08 08:37:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
crm_resource -C -r rabbitmq-clone -VVVVV output (66.91 KB, text/plain) 2015-10-07 14:32 UTC, Asaf Hirshberg	no flags	Details
using crm_resource -C -r rabbitmq-clone -VVVVV with pacemaker-libs-1.1.13-8.el7.x86_64 (16.93 KB, text/plain) 2015-10-08 06:53 UTC, Asaf Hirshberg	no flags	Details
View All

Description Lee Yarwood 2015-07-29 16:40:04 UTC

Description of problem:
rabbitmq-server resource not launching on a node after previous failures. These previous failures were caused by the rabbitmq-server systemd service being enabled while the host was rebooted. 

Version-Release number of selected component (if applicable):
pacemaker-1.1.12-22.el7_1.2.x86_64

How reproducible:
Unclear.

Steps to Reproduce:
Unclear.

Actual results:
It appears that the resource is even not asked to start on the node in question by pacemaker. The following is seen in pacemarker.log even after `pcs resource cleanup abbitmq-server-clone` is called :

Jul 29 15:18:01 mac1234 pengine[2869]: warning: common_apply_stickiness: Forcing rabbitmq-server-clone away from pcmk-mac1234 after 1000000 failures (max=1000000)
Jul 29 15:18:01 mac1234 pengine[2869]: warning: common_apply_stickiness: Forcing rabbitmq-server-clone away from pcmk-mac1234 after 1000000 failures (max=1000000)
Jul 29 15:18:01 mac1234 pengine[2869]: warning: common_apply_stickiness: Forcing rabbitmq-server-clone away from pcmk-mac1234 after 1000000 failures (max=1000000)

Expected results:
The resource is cleaned up and able to start on a node after previous failures.

Additional info:

Comment 3 Andrew Beekhof 2015-07-29 20:09:09 UTC

Ban + clear wont achieve anything here

Comment 4 Andrew Beekhof 2015-07-29 20:36:01 UTC

So its not being started due to:

          <nvpair id="status-3-last-failure-rabbitmq-server" name="last-failure-rabbitmq-server" value="1438107693"/>
          <nvpair id="status-3-fail-count-rabbitmq-server" name="fail-count-rabbitmq-server" value="INFINITY"/>

Is `pcs resource cleanup abbitmq-server-clone` a typo only in the BZ, or is this the actual command that was run?

Comment 5 Lee Yarwood 2015-07-29 21:48:42 UTC

(In reply to Andrew Beekhof from comment #4)
> So its not being started due to:
> 
>           <nvpair id="status-3-last-failure-rabbitmq-server"
> name="last-failure-rabbitmq-server" value="1438107693"/>
>           <nvpair id="status-3-fail-count-rabbitmq-server"
> name="fail-count-rabbitmq-server" value="INFINITY"/>
> 
> Is `pcs resource cleanup abbitmq-server-clone` a typo only in the BZ, or is
> this the actual command that was run?

Just a typo, apologies.

Comment 6 Andrew Beekhof 2015-07-30 02:32:58 UTC

Could you try:

   crm_resource -C -r rabbitmq-server

and/or 

   `pcs resource cleanup rabbitmq-server`

then report the result of:

    cibadmin Q | grep fail-count-rabbitmq-server

Comment 7 Andrew Beekhof 2015-07-30 02:36:29 UTC

Some background... there was a time when pacemaker didn't automatically clean up fail-counts - but 7.1 should have that code.

Also, some versions of pcs and crm_resource didn't behave consistently depending on whether the supplied resource name did or did not include the "-clone" suffix.

Comment 8 Lee Yarwood 2015-07-30 09:00:23 UTC

`crm_resource -C -r rabbitmq-server` correctly cleaned the resource up and allowed it to run again on the host. `cibadmin Q | grep fail-count-rabbitmq-server` then reported no output.

I forgot to list that we had already attempted to run `pcs resource cleanup rabbitmq-server` a number of times without success.

Andrew, I'll leave the next steps in the bug to you as I'm not sure if there is anything to resolve here other than documenting this corner case.

Comment 9 Perry Myers 2015-07-30 11:41:37 UTC

@beekhof, this seems like a common case... someone puts a resource under pacemaker control but forgets to disable the respective systemd service.

Is there anything we can do proactively to prevent this error from occurring? i.e. Pacemaker somehow periodically checks to see if there are systemd units enabled that should not be and either disables them or very vocally warns the user that they should not have these systemd units enabled, since the service is under pacemaker control?

I think the workaround described above for dealing with what to do when/if this does happens is fine, but I'd like to make sure that we also prevent it from happening in the first place.

Comment 10 Andrew Beekhof 2015-07-30 23:07:06 UTC

Yes, there is a plan involving the systemd provider API which looks relevant.

It is however a non-trivial task without much precedent and details on the API are sketchy, so it will likely take someone hiding in a cave for a few weeks to get it done.

Hence why it hasn't happened yet.

Comment 11 Andrew Beekhof 2015-07-30 23:53:16 UTC

(In reply to Lee Yarwood from comment #8)
> `crm_resource -C -r rabbitmq-server` correctly cleaned the resource up and
> allowed it to run again on the host. `cibadmin Q | grep
> fail-count-rabbitmq-server` then reported no output.
> 
> I forgot to list that we had already attempted to run `pcs resource cleanup
> rabbitmq-server` a number of times without success.

That is very weird because looking at the pcs sources, 

    pcs resource cleanup rabbitmq-server

executes:

    crm_resource -C -r rabbitmq-server

> Andrew, I'll leave the next steps in the bug to you as I'm not sure if there
> is anything to resolve here other than documenting this corner case.

Next step here is to check that `crm_resource -C -r XYZ` and `crm_resource -C -r XYZ-clone` both correctly clear the relevant failcount.

After that, I may bump to the pcs guys to follow up on their side.

Comment 12 Andrew Beekhof 2015-08-03 01:49:35 UTC

This patch should do it:

https://github.com/beekhof/pacemaker/commit/181272b

Comment 15 Asaf Hirshberg 2015-10-07 14:32:14 UTC

Created attachment 1080695 [details]
crm_resource -C -r rabbitmq-clone -VVVVV output

Comment 16 Asaf Hirshberg 2015-10-07 14:37:54 UTC

Verified:
Using: 7.0-RHEL-7-director/2015-10-01.1
       resource-agents-3.9.5-52.el7.x86_64 # wasn't include in the puddle #
       pacemaker-1.1.12-22.el7_1.4.x86_64
       rabbitmq-server-3.3.5-5.el7ost.noarch


[root@overcloud-controller-1 ~]# pcs resource failcount show rabbitmq
Failcounts for rabbitmq
 overcloud-controller-1: 2
[root@overcloud-controller-1 ~]# crm_resource -C -r rabbitmq-clone 
Cleaning up rabbitmq:0 on overcloud-controller-0
Cleaning up rabbitmq:0 on overcloud-controller-1
Cleaning up rabbitmq:0 on overcloud-controller-2
Cleaning up rabbitmq:1 on overcloud-controller-0
Cleaning up rabbitmq:1 on overcloud-controller-1
Cleaning up rabbitmq:1 on overcloud-controller-2
Cleaning up rabbitmq:2 on overcloud-controller-0
Cleaning up rabbitmq:2 on overcloud-controller-1
Cleaning up rabbitmq:2 on overcloud-controller-2
Waiting for 9 replies from the CRMd......... OK
[root@overcloud-controller-1 ~]# pcs resource failcount show rabbitmq
No failcounts for rabbitmq

Full output of "crm_resource -C -r rabbitmq-clone -VVVVV" is attached above.

Steps: 1) kill -9 <rabbitmq PID>
       2) pcs resource failcount show rabbitmq
       3) crm_resource -C -r rabbitmq-clone

Comment 17 Asaf Hirshberg 2015-10-08 06:52:04 UTC

Fix: forgot update pacemaker-1.1.13-8.el7.x86_64

verified, new results:

[root@overcloud-controller-0 ~]# pcs resource failcount show rabbitmq
Failcounts for rabbitmq
 overcloud-controller-0: 2
[root@overcloud-controller-0 ~]# crm_resource -C -r rabbitmq-clone
Cleaning up rabbitmq:0 on overcloud-controller-0, removing fail-count-rabbitmq
Cleaning up rabbitmq:0 on overcloud-controller-1, removing fail-count-rabbitmq
Cleaning up rabbitmq:0 on overcloud-controller-2, removing fail-count-rabbitmq
Waiting for 3 replies from the CRMd... OK
[root@overcloud-controller-0 ~]# 

Full output of "crm_resource -C -r rabbitmq-clone -VVVVV" is attached above.

Comment 18 Asaf Hirshberg 2015-10-08 06:53:03 UTC

Created attachment 1080863 [details]
using crm_resource -C -r rabbitmq-clone -VVVVV with pacemaker-libs-1.1.13-8.el7.x86_64

Note You need to log in before you can comment on or make changes to this bug.