Bug 1388098 - [RFE] Prevent RHV-M from restarting hosts during large outage [NEEDINFO]
Summary: [RFE] Prevent RHV-M from restarting hosts during large outage
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.6.9
Hardware: All
OS: Linux
medium
medium
Target Milestone: ovirt-4.3.0
: ---
Assignee: Ori Liel
QA Contact: Petr Matyáš
URL:
Whiteboard:
Depends On:
Blocks: CEECIR_RHV43_proposed
TreeView+ depends on / blocked
 
Reported: 2016-10-24 13:02 UTC by Julio Entrena Perez
Modified: 2019-05-08 12:37 UTC (History)
13 users (show)

Fixed In Version: ovirt-engine-4.3.0_rc
Doc Type: Enhancement
Doc Text:
The current release provides a software hook for the Manager to disable restarting hosts following an outage. For example, this capability would help prevent thermal damage to hardware following an HVAC failure.
Clone Of:
Environment:
Last Closed: 2019-05-08 12:36:48 UTC
oVirt Team: Infra
Target Upstream Version:
rdlugyhe: needinfo? (oliel)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2763751 0 None None None 2016-11-13 19:59:37 UTC
Red Hat Product Errata RHEA-2019:1085 0 None None None 2019-05-08 12:37:11 UTC
oVirt gerrit 95373 0 master MERGED engine: Prevent automatic PM on hosts with non-ok external status 2020-03-18 20:48:48 UTC
oVirt gerrit 95510 0 master MERGED engine: Prevent automatic PM on hosts with non-ok external status - cont'd 2020-03-18 20:48:48 UTC

Comment 1 Julio Entrena Perez 2016-10-24 13:03:17 UTC
1. Proposed title of this feature request  
Provide a hook mechanism for fencing

3. What is the nature and description of the request?  
Provide a hook mechanism for the fencing flow so customers can add hooks to influence (prevent) a host from being fenced.
  
4. Why does the customer need this? (List the business requirements here)  
Customer had an air conditioning outage in one of their datacenters. This resulted in servers powering down in reaction to overheating events.
RHEV-M kept powering the servers back on which is undesired in such scneario due to:
- risks of hardware being damaged.
- instability in the RHEV clusters due to servers continuously coming online and going offline.

Customer would like a mechanism that allows to check if it's safe to power on a host before doing so.
Due to the vast range of outband management devices and possible checks, customer accepts that the requested capability is delivered via a hook mechanism where they can plug their custom script, thus keeping RHEV agnostic and flexible in this regard.
  
5. How would the customer like to achieve this? (List the functional requirements here)  
RHEV provides a hook mechanism that allows customer to specify an optional script that must be successfully executed prior to powering up a host.
(Customer will use this script to query the ambient temperature via IPMI and return success if the temperature is within range).
  
6. For each functional requirement listed, specify how Red Hat and the customer can test to confirm the requirement is successfully implemented.  
RHEV allows specifying an optional script that must be successfully executed before powering on a server (before "action = Start" fence command).
RHEV passes the details of the fencing device (address, username, password, etc) to the hook script so the fencing device details are available to the hook script.
RHEV takes the exit code of such script into account and only proceeds to powering on a server if the exit code of the custom hook script allows so.
  
7. Is there already an existing RFE upstream or in Red Hat Bugzilla?  
None found.
We'd expect this request to be achievable after the fencing refactoring done in bug 1158861 .
  
8. Does the customer have any specific timeline dependencies and which release would they like to target (i.e. RHEL5, RHEL6)?  
Next RHEV major release.
  
9. Is the sales team involved in this request and do they have any additional input?  
No.
  
10. List any affected packages or components.  
Unknown.
  
11. Would the customer be able to assist in testing this functionality if implemented?  
Yes.

Comment 20 Yaniv Kaul 2018-01-30 08:34:16 UTC
Martin, I once again propose we look at the External Status field - this is exactly what it's there for. If it's anything other than OK, the Engine should simply not change the status of the host - not fence it, not move it to Maintenance, not Up, nothing.

Thoughts?

Comment 23 Petr Matyáš 2019-01-23 12:15:36 UTC
Verified on ovirt-engine-4.3.0-0.8.rc2.el7.noarch

Comment 26 errata-xmlrpc 2019-05-08 12:36:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1085


Note You need to log in before you can comment on or make changes to this bug.