Bug 1120829

Summary: [RFE] Do not fence hosts when more than X% of hosts are in a Non-Responding or Connecting state
Product: Red Hat Enterprise Virtualization Manager Reporter: Scott Herold <sherold>
Component: ovirt-engineAssignee: Eli Mesika <emesika>
Status: CLOSED ERRATA QA Contact: sefi litmanovich <slitmano>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.4.0CC: bazulay, ecohen, howey.vernon, iheim, lbopf, lpeer, oourfali, pablo.iranzo, pstehlik, rbalakri, Rhev-m-bugs, sherold, yeylon
Target Milestone: ---Keywords: FutureFeature
Target Release: 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: infra
Fixed In Version: ovirt-3.5.0_rc1.1 Doc Type: Enhancement
Doc Text:
A new option in the 'Fencing Policy' tab of the 'New/Edit Cluster' window allows users to disable fencing of hosts in the cluster if more than a user-defined percentage of hosts have connectivity issues. This can prevent hosts being fenced in scenarios where hosts are in a 'Non-Responding' or 'Connecting' state due to a general network connectivity error, rather than a host error.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-02-11 18:06:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1084611, 1142923, 1156165    

Description Scott Herold 2014-07-17 19:37:28 UTC
Infra/Engine component to improve fencing logic

When Triggered
--------------
This action is triggered once RHEV-M had made the decision that a target host may need to be fenced.  Prior to issuing a fencing command, a user may specify an optional time value to wait to investigate a percentage or number of total hosts that have moved to a non-operational state.

In this workflow, a calculation will be done over a short period of time that will determine if one or two optional conditions are met before sending fencing commands to proxy hosts:

1) If more than X% of hosts are non-operational, do not fence any host
2) If more than Y# of hosts are non-operational, do not fence any host

This will help prevent fence storms from leaving the engine to be potentially picked up in retries and introducing a potential race condition on fence requests.

UX
--
There will be an option in the Fencing Policy sub menu (Defined by BZ 1118879) to set the following options:

[Boolean]
"Do not fence if the following conditions are met" - Turns entire check on or off.  One or both % or # values must be specified if this option is enabled.

"Time of Delay" - Amount of time to wait for multiple hosts to enter non-responsive state
DEFAULT: 60 - Seconds

"Greater than X% of hosts fail"
DEFAULT: 50%

"More than Y# of hosts fail"
DEFAULT: 0/Disabled

Comment 1 Barak 2014-07-28 11:09:06 UTC
(In reply to Scott Herold from comment #0)
> Infra/Engine component to improve fencing logic
> 
> When Triggered
> --------------
> This action is triggered once RHEV-M had made the decision that a target
> host may need to be fenced.  Prior to issuing a fencing command, a user may
> specify an optional time value to wait to investigate a percentage or number
> of total hosts that have moved to a non-operational state.

The additional timeout is redundant there are too much timeout parameters as is in the fencing flow.
This decision should be done after the SSH vdsm restart that is happening somewhere in the beginning of the fencing flow.

> 
> In this workflow, a calculation will be done over a short period of time
> that will determine if one or two optional conditions are met before sending
> fencing commands to proxy hosts:
> 
> 1) If more than X% of hosts are non-operational, do not fence any host
> 2) If more than Y# of hosts are non-operational, do not fence any host

I do not understand #2 why define it - what use case will it cover ?

> 
> This will help prevent fence storms from leaving the engine to be
> potentially picked up in retries and introducing a potential race condition
> on fence requests.
> 
> UX
> --
> There will be an option in the Fencing Policy sub menu (Defined by BZ
> 1118879) to set the following options:
> 
> [Boolean]
> "Do not fence if the following conditions are met" - Turns entire check on
> or off.  One or both % or # values must be specified if this option is
> enabled.

The above is very confusing ... one or both ?
I think the # is redundant.

> 
> "Time of Delay" - Amount of time to wait for multiple hosts to enter
> non-responsive state
> DEFAULT: 60 - Seconds

above is redundant

> 
> "Greater than X% of hosts fail"
> DEFAULT: 50%
> 
> "More than Y# of hosts fail"
> DEFAULT: 0/Disabled

Comment 2 Scott Herold 2014-07-30 15:28:16 UTC
OK to limit scope to % of hosts that have failed and removing number of hosts from the rule.  Also OK to remove the timeout.  

Barak/Oved, with the simpler scope, realistically, is this a target for 3.5 or 3.6?

Comment 3 Oved Ourfali 2014-07-31 07:40:05 UTC
(In reply to Scott Herold from comment #2)
> OK to limit scope to % of hosts that have failed and removing number of
> hosts from the rule.  Also OK to remove the timeout.  
> 
> Barak/Oved, with the simpler scope, realistically, is this a target for 3.5
> or 3.6?

Targeting for 3.5.0.
Still waiting for an answer on what's the deadline for exception RFEs to be MODIFIED... the answer for that might change the target release, but we're targeted for 3.5.0 for now.

Comment 4 Eli Mesika 2014-08-05 12:01:14 UTC
F

Comment 5 Eli Mesika 2014-08-05 12:06:08 UTC
I had changed the BZ title to relate to the non-responding state which is network-based instead of the non-operational that is storage-based.

Also, I had concluded with Scott that for the percentage calculations we will seraph for all Host in Non-Responding or Connecting state since upon network issue a Host goes for a while first to the Connecting state before it is marked as Non-Responding

Comment 6 Eli Mesika 2014-08-05 12:08:16 UTC
(In reply to Eli Mesika from comment #5)
> will seraph
seraph=>search

Comment 7 Scott Herold 2014-08-05 12:10:00 UTC
Eli - Agree on the "Connecting" state.  It will be the first indication of a potential networking problem if half of the hosts go into a non-responsive state.  While we don't tack action on "Connecting", there may be overlap where we see a mix of non-responsive and connecting, and we want to include these in the total counts.

Comment 9 errata-xmlrpc 2015-02-11 18:06:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0158.html