Description of problem: rabbitmq-server resource not launching on a node after previous failures. These previous failures were caused by the rabbitmq-server systemd service being enabled while the host was rebooted. Version-Release number of selected component (if applicable): pacemaker-1.1.12-22.el7_1.2.x86_64 How reproducible: Unclear. Steps to Reproduce: Unclear. Actual results: It appears that the resource is even not asked to start on the node in question by pacemaker. The following is seen in pacemarker.log even after `pcs resource cleanup abbitmq-server-clone` is called : Jul 29 15:18:01 mac1234 pengine[2869]: warning: common_apply_stickiness: Forcing rabbitmq-server-clone away from pcmk-mac1234 after 1000000 failures (max=1000000) Jul 29 15:18:01 mac1234 pengine[2869]: warning: common_apply_stickiness: Forcing rabbitmq-server-clone away from pcmk-mac1234 after 1000000 failures (max=1000000) Jul 29 15:18:01 mac1234 pengine[2869]: warning: common_apply_stickiness: Forcing rabbitmq-server-clone away from pcmk-mac1234 after 1000000 failures (max=1000000) Expected results: The resource is cleaned up and able to start on a node after previous failures. Additional info:
Ban + clear wont achieve anything here
So its not being started due to: <nvpair id="status-3-last-failure-rabbitmq-server" name="last-failure-rabbitmq-server" value="1438107693"/> <nvpair id="status-3-fail-count-rabbitmq-server" name="fail-count-rabbitmq-server" value="INFINITY"/> Is `pcs resource cleanup abbitmq-server-clone` a typo only in the BZ, or is this the actual command that was run?
(In reply to Andrew Beekhof from comment #4) > So its not being started due to: > > <nvpair id="status-3-last-failure-rabbitmq-server" > name="last-failure-rabbitmq-server" value="1438107693"/> > <nvpair id="status-3-fail-count-rabbitmq-server" > name="fail-count-rabbitmq-server" value="INFINITY"/> > > Is `pcs resource cleanup abbitmq-server-clone` a typo only in the BZ, or is > this the actual command that was run? Just a typo, apologies.
Could you try: crm_resource -C -r rabbitmq-server and/or `pcs resource cleanup rabbitmq-server` then report the result of: cibadmin Q | grep fail-count-rabbitmq-server
Some background... there was a time when pacemaker didn't automatically clean up fail-counts - but 7.1 should have that code. Also, some versions of pcs and crm_resource didn't behave consistently depending on whether the supplied resource name did or did not include the "-clone" suffix.
`crm_resource -C -r rabbitmq-server` correctly cleaned the resource up and allowed it to run again on the host. `cibadmin Q | grep fail-count-rabbitmq-server` then reported no output. I forgot to list that we had already attempted to run `pcs resource cleanup rabbitmq-server` a number of times without success. Andrew, I'll leave the next steps in the bug to you as I'm not sure if there is anything to resolve here other than documenting this corner case.
@beekhof, this seems like a common case... someone puts a resource under pacemaker control but forgets to disable the respective systemd service. Is there anything we can do proactively to prevent this error from occurring? i.e. Pacemaker somehow periodically checks to see if there are systemd units enabled that should not be and either disables them or very vocally warns the user that they should not have these systemd units enabled, since the service is under pacemaker control? I think the workaround described above for dealing with what to do when/if this does happens is fine, but I'd like to make sure that we also prevent it from happening in the first place.
Yes, there is a plan involving the systemd provider API which looks relevant. It is however a non-trivial task without much precedent and details on the API are sketchy, so it will likely take someone hiding in a cave for a few weeks to get it done. Hence why it hasn't happened yet.
(In reply to Lee Yarwood from comment #8) > `crm_resource -C -r rabbitmq-server` correctly cleaned the resource up and > allowed it to run again on the host. `cibadmin Q | grep > fail-count-rabbitmq-server` then reported no output. > > I forgot to list that we had already attempted to run `pcs resource cleanup > rabbitmq-server` a number of times without success. That is very weird because looking at the pcs sources, pcs resource cleanup rabbitmq-server executes: crm_resource -C -r rabbitmq-server > Andrew, I'll leave the next steps in the bug to you as I'm not sure if there > is anything to resolve here other than documenting this corner case. Next step here is to check that `crm_resource -C -r XYZ` and `crm_resource -C -r XYZ-clone` both correctly clear the relevant failcount. After that, I may bump to the pcs guys to follow up on their side.
This patch should do it: https://github.com/beekhof/pacemaker/commit/181272b
Created attachment 1080695 [details] crm_resource -C -r rabbitmq-clone -VVVVV output
Verified: Using: 7.0-RHEL-7-director/2015-10-01.1 resource-agents-3.9.5-52.el7.x86_64 # wasn't include in the puddle # pacemaker-1.1.12-22.el7_1.4.x86_64 rabbitmq-server-3.3.5-5.el7ost.noarch [root@overcloud-controller-1 ~]# pcs resource failcount show rabbitmq Failcounts for rabbitmq overcloud-controller-1: 2 [root@overcloud-controller-1 ~]# crm_resource -C -r rabbitmq-clone Cleaning up rabbitmq:0 on overcloud-controller-0 Cleaning up rabbitmq:0 on overcloud-controller-1 Cleaning up rabbitmq:0 on overcloud-controller-2 Cleaning up rabbitmq:1 on overcloud-controller-0 Cleaning up rabbitmq:1 on overcloud-controller-1 Cleaning up rabbitmq:1 on overcloud-controller-2 Cleaning up rabbitmq:2 on overcloud-controller-0 Cleaning up rabbitmq:2 on overcloud-controller-1 Cleaning up rabbitmq:2 on overcloud-controller-2 Waiting for 9 replies from the CRMd......... OK [root@overcloud-controller-1 ~]# pcs resource failcount show rabbitmq No failcounts for rabbitmq Full output of "crm_resource -C -r rabbitmq-clone -VVVVV" is attached above. Steps: 1) kill -9 <rabbitmq PID> 2) pcs resource failcount show rabbitmq 3) crm_resource -C -r rabbitmq-clone
Fix: forgot update pacemaker-1.1.13-8.el7.x86_64 verified, new results: [root@overcloud-controller-0 ~]# pcs resource failcount show rabbitmq Failcounts for rabbitmq overcloud-controller-0: 2 [root@overcloud-controller-0 ~]# crm_resource -C -r rabbitmq-clone Cleaning up rabbitmq:0 on overcloud-controller-0, removing fail-count-rabbitmq Cleaning up rabbitmq:0 on overcloud-controller-1, removing fail-count-rabbitmq Cleaning up rabbitmq:0 on overcloud-controller-2, removing fail-count-rabbitmq Waiting for 3 replies from the CRMd... OK [root@overcloud-controller-0 ~]# Full output of "crm_resource -C -r rabbitmq-clone -VVVVV" is attached above.
Created attachment 1080863 [details] using crm_resource -C -r rabbitmq-clone -VVVVV with pacemaker-libs-1.1.13-8.el7.x86_64