1090511 – OVIRT35 - [RFE] Improve fencing robustness by retrying failed attempts

Bug 1090511 - OVIRT35 - [RFE] Improve fencing robustness by retrying failed attempts

Summary: OVIRT35 - [RFE] Improve fencing robustness by retrying failed attempts

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	oVirt
Classification:	Retired
Component:	ovirt-engine-core
Sub Component:
Version:	3.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Eli Mesika
QA Contact:	sefi litmanovich
Docs Contact:
URL:
Whiteboard:	infra
Depends On:	1129381
Blocks:	961753
TreeView+	depends on / blocked

Reported:	2014-04-23 13:44 UTC by Oved Ourfali
Modified:	2016-02-10 19:33 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-10-17 12:41:32 UTC
oVirt Team:	Infra
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	27309	0	None	None	None	Never

Description Oved Ourfali 2014-04-23 13:44:52 UTC

Currently when a hypervisor becomes unresponsive and has to be fenced only one other host (the "fence proxy") within the cluster is responsible of fencing it. Moreover, if the fencing action fails for some reason, it's not re-attempted - leaving the victim host as unresponsive and requiring manual intervention.

The request here is to improve the robustness of fencing. If a fencing attempt fails (e.g. temporary communication problem between the chosen proxy host and the victim's PM), then re-attempt the fencing action, and/or attempt it from a different host. 

The "fence proxy" might have some connectivity problems to the victim's power management system, but it could well be that other hosts can access it and succeed at fencing. Also, some of these failures are transient.

Failing at first attempt and not re-trying requires manual operator intervention. While we wait for this to happen, we could keep trying from other hosts.

Comment 1 Sven Kieske 2014-04-23 14:00:28 UTC

Additional idea:
Why not make it configurable what should happen when fencing fails
the first time?
Or include some kind of default policy configuration for fencing
via a configuration file, which can be edited by users to their needs?

Comment 2 Sandro Bonazzola 2014-10-17 12:41:32 UTC

oVirt 3.5 has been released and should include the fix for this issue.

Note You need to log in before you can comment on or make changes to this bug.