Bug 1388098

Summary: [RFE] Prevent RHV-M from restarting hosts during large outage
Product: Red Hat Enterprise Virtualization Manager Reporter: Julio Entrena Perez <jentrena>
Component: ovirt-engineAssignee: Ori Liel <oliel>
Status: CLOSED ERRATA QA Contact: Petr Matyáš <pmatyas>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.6.9CC: dmoessne, jentrena, lsurette, lsvaty, mgoldboi, mkalinin, mperina, mtessun, oliel, pstehlik, rdlugyhe, Rhev-m-bugs, srevivo
Target Milestone: ovirt-4.3.0Keywords: FutureFeature
Target Release: ---Flags: rdlugyhe: needinfo? (oliel)
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: ovirt-engine-4.3.0_rc Doc Type: Enhancement
Doc Text:
The current release provides a software hook for the Manager to disable restarting hosts following an outage. For example, this capability would help prevent thermal damage to hardware following an HVAC failure.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-05-08 12:36:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1523346    

Comment 1 Julio Entrena Perez 2016-10-24 13:03:17 UTC
1. Proposed title of this feature request  
Provide a hook mechanism for fencing

3. What is the nature and description of the request?  
Provide a hook mechanism for the fencing flow so customers can add hooks to influence (prevent) a host from being fenced.
  
4. Why does the customer need this? (List the business requirements here)  
Customer had an air conditioning outage in one of their datacenters. This resulted in servers powering down in reaction to overheating events.
RHEV-M kept powering the servers back on which is undesired in such scneario due to:
- risks of hardware being damaged.
- instability in the RHEV clusters due to servers continuously coming online and going offline.

Customer would like a mechanism that allows to check if it's safe to power on a host before doing so.
Due to the vast range of outband management devices and possible checks, customer accepts that the requested capability is delivered via a hook mechanism where they can plug their custom script, thus keeping RHEV agnostic and flexible in this regard.
  
5. How would the customer like to achieve this? (List the functional requirements here)  
RHEV provides a hook mechanism that allows customer to specify an optional script that must be successfully executed prior to powering up a host.
(Customer will use this script to query the ambient temperature via IPMI and return success if the temperature is within range).
  
6. For each functional requirement listed, specify how Red Hat and the customer can test to confirm the requirement is successfully implemented.  
RHEV allows specifying an optional script that must be successfully executed before powering on a server (before "action = Start" fence command).
RHEV passes the details of the fencing device (address, username, password, etc) to the hook script so the fencing device details are available to the hook script.
RHEV takes the exit code of such script into account and only proceeds to powering on a server if the exit code of the custom hook script allows so.
  
7. Is there already an existing RFE upstream or in Red Hat Bugzilla?  
None found.
We'd expect this request to be achievable after the fencing refactoring done in bug 1158861 .
  
8. Does the customer have any specific timeline dependencies and which release would they like to target (i.e. RHEL5, RHEL6)?  
Next RHEV major release.
  
9. Is the sales team involved in this request and do they have any additional input?  
No.
  
10. List any affected packages or components.  
Unknown.
  
11. Would the customer be able to assist in testing this functionality if implemented?  
Yes.

Comment 20 Yaniv Kaul 2018-01-30 08:34:16 UTC
Martin, I once again propose we look at the External Status field - this is exactly what it's there for. If it's anything other than OK, the Engine should simply not change the status of the host - not fence it, not move it to Maintenance, not Up, nothing.

Thoughts?

Comment 23 Petr Matyáš 2019-01-23 12:15:36 UTC
Verified on ovirt-engine-4.3.0-0.8.rc2.el7.noarch

Comment 26 errata-xmlrpc 2019-05-08 12:36:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1085