Bug 1194301

Summary:	HA \| Current fencing configuration(ipmilan default command) only shuts down the fenced host and does not bring it up.
Product:	Red Hat OpenStack	Reporter:	Leonid Natapov <lnatapov>
Component:	rhosp-director	Assignee:	Chris Jones <chjones>
Status:	CLOSED DUPLICATE	QA Contact:	Udi Shkalim <ushkalim>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	7.0 (Kilo)	CC:	abeekhof, dhill, fdinitto, jcoufal, jguiditt, lnatapov, mburns, michele, morazi, oblaut, rhel-osp-director-maint, rhos-maint, royoung, srevivo, tshefi, ushkalim
Target Milestone:	ga	Keywords:	FutureFeature, InstallerIntegration, UserExperience
Target Release:	11.0 (Ocata)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-01-11 04:47:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1242422
Bug Blocks:

Description Leonid Natapov 2015-02-19 14:10:33 UTC

Description of problem:
Current fencing configuration only shuts down the fenced host and not brings it up. Fenced host stays down until manually started.

Scenario:
I have HA deployment with properly defined fencing (confirmed by Andrew Beekhof)
I execute some action (heartbeat interface restart) that must trigger STONITH and fence the host. I can see that host is being fenced,it shuts down but never comes up and stays down forever until I manually start it.

I am using RHEL 7.1 with GA and RHEL 7.0 had no problems.

Here is the log that shows that host was fenced but POWER ON failed because of time out.
-------------------------------------------------------------------------------
Feb 19 14:03:09 mac848f69fbc643 stonith-ng[15986]: notice: can_fence_host_with_device: stonith-ipmilan-10.35.160.168 can not fence (reboot) pcmk-mac848f69fbc4c3: static-list
Feb 19 14:03:09 mac848f69fbc643 stonith-ng[15986]: notice: can_fence_host_with_device: stonith-ipmilan-10.35.160.170 can fence (reboot) pcmk-mac848f69fbc4c3: static-list
Feb 19 14:03:09 mac848f69fbc643 stonith-ng[15986]: notice: can_fence_host_with_device: stonith-ipmilan-10.35.160.168 can not fence (reboot) pcmk-mac848f69fbc4c3: static-list
Feb 19 14:03:09 mac848f69fbc643 stonith-ng[15986]: notice: can_fence_host_with_device: stonith-ipmilan-10.35.160.170 can fence (reboot) pcmk-mac848f69fbc4c3: static-list

Feb 19 14:03:37 mac848f69fbc643 fence_ipmilan: Timed out waiting to power ON

-----------------------------------------------------------------------------
Here is out put of pcs cluster cib
http://fpaste.org/187277/24298309/
------------------------------------------------------------------------------

We can change the fencing command and add  --retry-on=3 (so try 3x times to power it on) and possibly extend --power-timeout=60. But first we need to understand if this is a good default and if it is why we don't assume it, and if there is a case where we would want the user to be able to change it.

Comment 5 Leonid Natapov 2015-02-19 14:30:33 UTC

Just to make it clear. When I was using GA with RHEL 7 there was no problems with the fencing. Now I am using A1 with RHEL 7.1

Comment 6 Jason Guiditta 2015-02-19 15:35:57 UTC

Currently (and since at least april of 2014), the command we run to crate ipmilan stonith device is:

/usr/sbin/pcs stonith create stonith-ipmilan-${real_address} fence_ipmilan ${pcmk_host_list_chunk} ipaddr=${real_address} ${username_chunk} ${password_chunk} ${lanplus_chunk} op monitor interval=${interval}

Comment 7 Jason Guiditta 2015-03-11 17:05:36 UTC

Andrew, can you provide any insight on this?  Was there a change between RHEL 7.0 and 7.1 that could cause what Leonid is seeing? Is the setting he mentions reasonable?  is it something we want the user to be able to tweak, even if it is reasonable as a default?

Comment 8 Andrew Beekhof 2015-03-11 18:58:11 UTC

Jason:  No change, just some devices are evidently slower.

In general there is no penalty for having a longer timeout, we'll continue as soon as the fencing completes.

One could even argue that the agent itself should be a little more generous.

Comment 9 Leonid Natapov 2015-04-26 09:37:28 UTC

This bug happen on another setup here in QA. Fencing only shuts down cluster node and it stays down. RHWL 7.1. Jason ,do you know when we are going to fix it ?

Comment 11 Jason Guiditta 2015-04-28 12:54:40 UTC

(In reply to Leonid Natapov from comment #9)
> This bug happen on another setup here in QA. Fencing only shuts down cluster
> node and it stays down. RHWL 7.1. Jason ,do you know when we are going to
> fix it ?

There is currently no target release, and I do not know what the recommendation is for a proper setting from the pacemaker team.  Andrew, could you clarify your previous response so I know what to fix?  If this is to make any release, my guess would be A4.

Comment 12 Andrew Beekhof 2015-05-12 01:07:54 UTC

'fence_ipmilan -o metadata' shows the following option:

	<parameter name="power_timeout" unique="0" required="0">
		<getopt mixed="--power-timeout=[seconds]" />
		<content type="string" default="20"  />
		<shortdesc lang="en">Test X seconds for status change after ON/OFF</shortdesc>
	</parameter>

and

	<parameter name="retry_on" unique="0" required="0">
		<getopt mixed="--retry-on=[attempts]" />
		<content type="string" default="1"  />
		<shortdesc lang="en">Count of attempts to retry power on</shortdesc>
	</parameter>

which correspond to the options mentioned in the description.

I'd suggest adding the following parameters to the command for creating the fencing device in pcs:

  power_timeout=60 retry_on=3

Comment 14 Leonid Natapov 2015-07-21 15:01:13 UTC

Reproduces on the latest rhel-osp-director puddle.

Comment 15 Ofer Blaut 2015-11-25 09:34:04 UTC

moving the bug to ospd, since the problem is still there

Comment 16 Michele Baldessari 2015-12-01 09:19:06 UTC

Hi Leonid,

can you clarify a bit what you mean with reproduced in director? (I ask because
osp-d does not configure fencing out of the box)

Can you share the following:
- CIB (pcs cluster cib) of this OSP-d cluster?
- I.e. how did you configure fencing and how are you trying to fence the node?
- Can we get /var/log/pacemaker.log from all three controllers

thanks,
Michele

Comment 17 Leonid Natapov 2015-12-01 13:45:37 UTC

I configure it manually according to this guide:
https://docs.google.com/document/d/10FPwRba6aJ4PzXLw7FKR77mV5PwKpSOiEUwGAyc1o18/edit

Comment 18 Michele Baldessari 2015-12-01 19:54:37 UTC

So I tried to reproduce this to no avail (aka fencing does a correct reboot).

This might simply due to your baremetal needing a bit of tuning due to IPMI being slow to respond. Have you tried the suggestion Andrew gave in comment 12?

If you still have this environment around, can you ping me online (bandini)
and I'll take a look?

Comment 19 Leonid Natapov 2015-12-13 06:29:41 UTC

(In reply to Michele Baldessari from comment #18)
> So I tried to reproduce this to no avail (aka fencing does a correct reboot).
> 
> This might simply due to your baremetal needing a bit of tuning due to IPMI
> being slow to respond. Have you tried the suggestion Andrew gave in comment
> 12?
> 
> If you still have this environment around, can you ping me online (bandini)
> and I'll take a look?

Sure,will ping you. You can find me also on irc (Lesik).

Comment 23 David Hill 2016-04-28 14:33:16 UTC

Would updating ipmitool to the same version as BZ 1269523 help fix this issue?

Comment 24 Leonid Natapov 2016-05-07 15:40:32 UTC

(In reply to David Hill from comment #23)
> Would updating ipmitool to the same version as BZ 1269523 help fix this
> issue?

I have tested with ipmitool-1.8.15-7.el7.x86_64 and the problem still exists.

Comment 25 Michele Baldessari 2016-06-24 07:48:03 UTC

Has the suggestion in comment 12 been tried? What was the outcome?

Comment 27 Andrew Beekhof 2016-08-24 01:13:03 UTC

Can we get some feedback on the suggestion in comment #12 please?

Comment 28 Leonid Natapov 2016-08-25 10:05:50 UTC

(In reply to Andrew Beekhof from comment #27)
> Can we get some feedback on the suggestion in comment #12 please?

Hey Andrew. Sorry for the late response.

It has been tried and it solved the problem.

Comment 29 Fabio Massimo Di Nitto 2016-08-30 12:09:24 UTC

Moving to OSP11 and depends on auto-fencing configuration.

This bug is mostly as input for auto-fencing configuration, Chris we need to make sure that we use some sane timeout defaults when configuring devices, or allow overrides as we go.

Comment 31 Fabio Massimo Di Nitto 2017-01-11 04:47:58 UTC


*** This bug has been marked as a duplicate of bug 1242422 ***