Description of problem: Current fencing configuration only shuts down the fenced host and not brings it up. Fenced host stays down until manually started. Scenario: I have HA deployment with properly defined fencing (confirmed by Andrew Beekhof) I execute some action (heartbeat interface restart) that must trigger STONITH and fence the host. I can see that host is being fenced,it shuts down but never comes up and stays down forever until I manually start it. I am using RHEL 7.1 with GA and RHEL 7.0 had no problems. Here is the log that shows that host was fenced but POWER ON failed because of time out. ------------------------------------------------------------------------------- Feb 19 14:03:09 mac848f69fbc643 stonith-ng[15986]: notice: can_fence_host_with_device: stonith-ipmilan-10.35.160.168 can not fence (reboot) pcmk-mac848f69fbc4c3: static-list Feb 19 14:03:09 mac848f69fbc643 stonith-ng[15986]: notice: can_fence_host_with_device: stonith-ipmilan-10.35.160.170 can fence (reboot) pcmk-mac848f69fbc4c3: static-list Feb 19 14:03:09 mac848f69fbc643 stonith-ng[15986]: notice: can_fence_host_with_device: stonith-ipmilan-10.35.160.168 can not fence (reboot) pcmk-mac848f69fbc4c3: static-list Feb 19 14:03:09 mac848f69fbc643 stonith-ng[15986]: notice: can_fence_host_with_device: stonith-ipmilan-10.35.160.170 can fence (reboot) pcmk-mac848f69fbc4c3: static-list Feb 19 14:03:37 mac848f69fbc643 fence_ipmilan: Timed out waiting to power ON ----------------------------------------------------------------------------- Here is out put of pcs cluster cib http://fpaste.org/187277/24298309/ ------------------------------------------------------------------------------ We can change the fencing command and add --retry-on=3 (so try 3x times to power it on) and possibly extend --power-timeout=60. But first we need to understand if this is a good default and if it is why we don't assume it, and if there is a case where we would want the user to be able to change it.
Just to make it clear. When I was using GA with RHEL 7 there was no problems with the fencing. Now I am using A1 with RHEL 7.1
Currently (and since at least april of 2014), the command we run to crate ipmilan stonith device is: /usr/sbin/pcs stonith create stonith-ipmilan-${real_address} fence_ipmilan ${pcmk_host_list_chunk} ipaddr=${real_address} ${username_chunk} ${password_chunk} ${lanplus_chunk} op monitor interval=${interval}
Andrew, can you provide any insight on this? Was there a change between RHEL 7.0 and 7.1 that could cause what Leonid is seeing? Is the setting he mentions reasonable? is it something we want the user to be able to tweak, even if it is reasonable as a default?
Jason: No change, just some devices are evidently slower. In general there is no penalty for having a longer timeout, we'll continue as soon as the fencing completes. One could even argue that the agent itself should be a little more generous.
This bug happen on another setup here in QA. Fencing only shuts down cluster node and it stays down. RHWL 7.1. Jason ,do you know when we are going to fix it ?
(In reply to Leonid Natapov from comment #9) > This bug happen on another setup here in QA. Fencing only shuts down cluster > node and it stays down. RHWL 7.1. Jason ,do you know when we are going to > fix it ? There is currently no target release, and I do not know what the recommendation is for a proper setting from the pacemaker team. Andrew, could you clarify your previous response so I know what to fix? If this is to make any release, my guess would be A4.
'fence_ipmilan -o metadata' shows the following option: <parameter name="power_timeout" unique="0" required="0"> <getopt mixed="--power-timeout=[seconds]" /> <content type="string" default="20" /> <shortdesc lang="en">Test X seconds for status change after ON/OFF</shortdesc> </parameter> and <parameter name="retry_on" unique="0" required="0"> <getopt mixed="--retry-on=[attempts]" /> <content type="string" default="1" /> <shortdesc lang="en">Count of attempts to retry power on</shortdesc> </parameter> which correspond to the options mentioned in the description. I'd suggest adding the following parameters to the command for creating the fencing device in pcs: power_timeout=60 retry_on=3
Reproduces on the latest rhel-osp-director puddle.
moving the bug to ospd, since the problem is still there
Hi Leonid, can you clarify a bit what you mean with reproduced in director? (I ask because osp-d does not configure fencing out of the box) Can you share the following: - CIB (pcs cluster cib) of this OSP-d cluster? - I.e. how did you configure fencing and how are you trying to fence the node? - Can we get /var/log/pacemaker.log from all three controllers thanks, Michele
I configure it manually according to this guide: https://docs.google.com/document/d/10FPwRba6aJ4PzXLw7FKR77mV5PwKpSOiEUwGAyc1o18/edit
So I tried to reproduce this to no avail (aka fencing does a correct reboot). This might simply due to your baremetal needing a bit of tuning due to IPMI being slow to respond. Have you tried the suggestion Andrew gave in comment 12? If you still have this environment around, can you ping me online (bandini) and I'll take a look?
(In reply to Michele Baldessari from comment #18) > So I tried to reproduce this to no avail (aka fencing does a correct reboot). > > This might simply due to your baremetal needing a bit of tuning due to IPMI > being slow to respond. Have you tried the suggestion Andrew gave in comment > 12? > > If you still have this environment around, can you ping me online (bandini) > and I'll take a look? Sure,will ping you. You can find me also on irc (Lesik).
Would updating ipmitool to the same version as BZ 1269523 help fix this issue?
(In reply to David Hill from comment #23) > Would updating ipmitool to the same version as BZ 1269523 help fix this > issue? I have tested with ipmitool-1.8.15-7.el7.x86_64 and the problem still exists.
Has the suggestion in comment 12 been tried? What was the outcome?
Can we get some feedback on the suggestion in comment #12 please?
(In reply to Andrew Beekhof from comment #27) > Can we get some feedback on the suggestion in comment #12 please? Hey Andrew. Sorry for the late response. It has been tried and it solved the problem.
Moving to OSP11 and depends on auto-fencing configuration. This bug is mostly as input for auto-fencing configuration, Chris we need to make sure that we use some sane timeout defaults when configuring devices, or allow overrides as we go.
*** This bug has been marked as a duplicate of bug 1242422 ***