Bug 1248132
Summary: | rabbitmq-server resource not launching on node after failures. | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Lee Yarwood <lyarwood> | ||||||
Component: | pacemaker | Assignee: | Andrew Beekhof <abeekhof> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Asaf Hirshberg <ahirshbe> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 7.1 | CC: | abeekhof, cluster-maint, djansa, fdinitto, jruemker, oblaut, sputhenp | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | pacemaker-1.1.13-8.el7 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2016-11-08 08:37:43 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Lee Yarwood
2015-07-29 16:40:04 UTC
Ban + clear wont achieve anything here So its not being started due to: <nvpair id="status-3-last-failure-rabbitmq-server" name="last-failure-rabbitmq-server" value="1438107693"/> <nvpair id="status-3-fail-count-rabbitmq-server" name="fail-count-rabbitmq-server" value="INFINITY"/> Is `pcs resource cleanup abbitmq-server-clone` a typo only in the BZ, or is this the actual command that was run? (In reply to Andrew Beekhof from comment #4) > So its not being started due to: > > <nvpair id="status-3-last-failure-rabbitmq-server" > name="last-failure-rabbitmq-server" value="1438107693"/> > <nvpair id="status-3-fail-count-rabbitmq-server" > name="fail-count-rabbitmq-server" value="INFINITY"/> > > Is `pcs resource cleanup abbitmq-server-clone` a typo only in the BZ, or is > this the actual command that was run? Just a typo, apologies. Could you try: crm_resource -C -r rabbitmq-server and/or `pcs resource cleanup rabbitmq-server` then report the result of: cibadmin Q | grep fail-count-rabbitmq-server Some background... there was a time when pacemaker didn't automatically clean up fail-counts - but 7.1 should have that code. Also, some versions of pcs and crm_resource didn't behave consistently depending on whether the supplied resource name did or did not include the "-clone" suffix. `crm_resource -C -r rabbitmq-server` correctly cleaned the resource up and allowed it to run again on the host. `cibadmin Q | grep fail-count-rabbitmq-server` then reported no output. I forgot to list that we had already attempted to run `pcs resource cleanup rabbitmq-server` a number of times without success. Andrew, I'll leave the next steps in the bug to you as I'm not sure if there is anything to resolve here other than documenting this corner case. @beekhof, this seems like a common case... someone puts a resource under pacemaker control but forgets to disable the respective systemd service. Is there anything we can do proactively to prevent this error from occurring? i.e. Pacemaker somehow periodically checks to see if there are systemd units enabled that should not be and either disables them or very vocally warns the user that they should not have these systemd units enabled, since the service is under pacemaker control? I think the workaround described above for dealing with what to do when/if this does happens is fine, but I'd like to make sure that we also prevent it from happening in the first place. Yes, there is a plan involving the systemd provider API which looks relevant. It is however a non-trivial task without much precedent and details on the API are sketchy, so it will likely take someone hiding in a cave for a few weeks to get it done. Hence why it hasn't happened yet. (In reply to Lee Yarwood from comment #8) > `crm_resource -C -r rabbitmq-server` correctly cleaned the resource up and > allowed it to run again on the host. `cibadmin Q | grep > fail-count-rabbitmq-server` then reported no output. > > I forgot to list that we had already attempted to run `pcs resource cleanup > rabbitmq-server` a number of times without success. That is very weird because looking at the pcs sources, pcs resource cleanup rabbitmq-server executes: crm_resource -C -r rabbitmq-server > Andrew, I'll leave the next steps in the bug to you as I'm not sure if there > is anything to resolve here other than documenting this corner case. Next step here is to check that `crm_resource -C -r XYZ` and `crm_resource -C -r XYZ-clone` both correctly clear the relevant failcount. After that, I may bump to the pcs guys to follow up on their side. This patch should do it: https://github.com/beekhof/pacemaker/commit/181272b Created attachment 1080695 [details]
crm_resource -C -r rabbitmq-clone -VVVVV output
Verified: Using: 7.0-RHEL-7-director/2015-10-01.1 resource-agents-3.9.5-52.el7.x86_64 # wasn't include in the puddle # pacemaker-1.1.12-22.el7_1.4.x86_64 rabbitmq-server-3.3.5-5.el7ost.noarch [root@overcloud-controller-1 ~]# pcs resource failcount show rabbitmq Failcounts for rabbitmq overcloud-controller-1: 2 [root@overcloud-controller-1 ~]# crm_resource -C -r rabbitmq-clone Cleaning up rabbitmq:0 on overcloud-controller-0 Cleaning up rabbitmq:0 on overcloud-controller-1 Cleaning up rabbitmq:0 on overcloud-controller-2 Cleaning up rabbitmq:1 on overcloud-controller-0 Cleaning up rabbitmq:1 on overcloud-controller-1 Cleaning up rabbitmq:1 on overcloud-controller-2 Cleaning up rabbitmq:2 on overcloud-controller-0 Cleaning up rabbitmq:2 on overcloud-controller-1 Cleaning up rabbitmq:2 on overcloud-controller-2 Waiting for 9 replies from the CRMd......... OK [root@overcloud-controller-1 ~]# pcs resource failcount show rabbitmq No failcounts for rabbitmq Full output of "crm_resource -C -r rabbitmq-clone -VVVVV" is attached above. Steps: 1) kill -9 <rabbitmq PID> 2) pcs resource failcount show rabbitmq 3) crm_resource -C -r rabbitmq-clone Fix: forgot update pacemaker-1.1.13-8.el7.x86_64 verified, new results: [root@overcloud-controller-0 ~]# pcs resource failcount show rabbitmq Failcounts for rabbitmq overcloud-controller-0: 2 [root@overcloud-controller-0 ~]# crm_resource -C -r rabbitmq-clone Cleaning up rabbitmq:0 on overcloud-controller-0, removing fail-count-rabbitmq Cleaning up rabbitmq:0 on overcloud-controller-1, removing fail-count-rabbitmq Cleaning up rabbitmq:0 on overcloud-controller-2, removing fail-count-rabbitmq Waiting for 3 replies from the CRMd... OK [root@overcloud-controller-0 ~]# Full output of "crm_resource -C -r rabbitmq-clone -VVVVV" is attached above. Created attachment 1080863 [details]
using crm_resource -C -r rabbitmq-clone -VVVVV with pacemaker-libs-1.1.13-8.el7.x86_64
|