961753 – PRD35 - [RFE] Improve fencing robustness by retrying failed attempts

Bug 961753 - PRD35 - [RFE] Improve fencing robustness by retrying failed attempts

Summary: PRD35 - [RFE] Improve fencing robustness by retrying failed attempts

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.3.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Eli Mesika
QA Contact:	sefi litmanovich
Docs Contact:
URL:
Whiteboard:	infra
Duplicates (1):	1061722 (view as bug list)
Depends On:	1090511
Blocks:	rhev3.5beta 1156165
TreeView+	depends on / blocked

Reported:	2013-05-10 10:59 UTC by Josep 'Pep' Turro Mauri
Modified:	2019-03-22 07:08 UTC (History)
CC List:	13 users (show)
Fixed In Version:	ovirt-3.5.0-alpha1
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-02-11 17:53:02 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:0158	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Virtualization Manager 3.5.0	2015-02-11 22:38:50 UTC
oVirt gerrit	27309	0	None	MERGED	[RFE] Improve fencing robustness by retrying...	2020-09-17 09:12:06 UTC

Description Josep 'Pep' Turro Mauri 2013-05-10 10:59:48 UTC

Currently when a RHEV host becomes unresponsive and has to be fenced only one other host (the "fence proxy") within the cluster is responsible of fencing it. Moreover, if the fencing action fails for some reason, it's not re-attempted - leaving the victim host as unresponsive and requiring manual intervention.

The request here is to improve the robustness of fencing. If a fencing attempt fails (e.g. temporary communication problem between the chosen proxy host and the victim's PM), then re-attempt the fencing action, and/or attempt it from a different host. 

The "fence proxy" might have some connectivity problems to the victim's power management system, but it could well be that other hosts can access it and succeed at fencing. Also, some of these failures are transient.

Failing at first attempt and not re-trying requires manual operator intervention. While we wait for this to happen, we could keep trying from other hosts.

Comment 2 Barak 2013-05-12 12:49:02 UTC

is this RFE  just for the cases of:
- badly configured PMs
- or unreachable PMs

if this is the case than it should be rather easy to do.

When the fencing actually fails somewhere on the way (e.g after stop succeeded but start did not bring the host back to up ... after a few minutes), that is a different issue and will be harder to detect and implement.

Comment 3 Josep 'Pep' Turro Mauri 2013-05-13 16:00:32 UTC

The "suggested implementation" (just an idea) of retrying - potentially from a different proxy host every time, is for the simple case of misconfigured/unreachable PMs: the idea is that maybe the selected proxy host has some connectivity problems with the victim's PM, but another host might work just fine; or it could just be a transient problem (seen at the customer's environment).

However, the aim of this RFE is as broad as possible: make fencing as robust as possible, so ideally we would like to cover other scenarios too, like the one you describe of "partial success".

Maybe split in 2 RFEs? One "easy win" with the simple scenario (and hopefully in a release soon in your download channels ;) and another for the more complicated case(s).

Comment 4 Barak 2013-05-16 17:02:48 UTC

So to be clear,

badly configured PM will never succeed from any proxy.
So we need to check if we can differentiate badly configured from unreachable and try to handle only the unreachable through other proxy.
If we can't differentiate than the retries will take place for both above scenarios.

It looks like a general configuration is in order (number of retries ?)

Comment 5 Simon Grinberg 2013-05-19 11:56:05 UTC

(In reply to comment #4)
> So to be clear,
> 
> badly configured PM will never succeed from any proxy.
> So we need to check if we can differentiate badly configured from
> unreachable 

User should test configuration, we have the test button. This is use case is already solved. 

> and try to handle only the unreachable through other proxy.
> If we can't differentiate than the retries will take place for both above
> scenarios.

See the answer above if the customer tool care of the first part and 'tested' the configuration then retries are a good approach 

> 
> It looks like a general configuration is in order (number of retries ?)

Could be part of the proxy setting that we've added to 3.2. Meaning while you select the preferred proxy from the list, have another list allowing to select number of retries.

Comment 6 Andrew Cathrow 2013-06-05 07:11:57 UTC

Simplifying to :

Add option in config tool to retry fencing
- Number of retry attempts
- timeout between retries

Comment 7 Eli Mesika 2014-03-11 13:41:28 UTC

The current mechanism of proxy selection has already a retry setting in configuration.
What should be done to resolve this is that the host that is a proxy candidate will be tested for being a good proxy by issuing a STATUS fence command to the target host

Comment 8 Eli Mesika 2014-04-13 11:58:18 UTC

*** Bug 1061722 has been marked as a duplicate of this bug. ***

Comment 11 errata-xmlrpc 2015-02-11 17:53:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0158.html

Note You need to log in before you can comment on or make changes to this bug.