Bug 1120829
Summary: | [RFE] Do not fence hosts when more than X% of hosts are in a Non-Responding or Connecting state | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Scott Herold <sherold> |
Component: | ovirt-engine | Assignee: | Eli Mesika <emesika> |
Status: | CLOSED ERRATA | QA Contact: | sefi litmanovich <slitmano> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 3.4.0 | CC: | bazulay, ecohen, howey.vernon, iheim, lbopf, lpeer, oourfali, pablo.iranzo, pstehlik, rbalakri, Rhev-m-bugs, sherold, yeylon |
Target Milestone: | --- | Keywords: | FutureFeature |
Target Release: | 3.5.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | infra | ||
Fixed In Version: | ovirt-3.5.0_rc1.1 | Doc Type: | Enhancement |
Doc Text: |
A new option in the 'Fencing Policy' tab of the 'New/Edit Cluster' window allows users to disable fencing of hosts in the cluster if more than a user-defined percentage of hosts have connectivity issues. This can prevent hosts being fenced in scenarios where hosts are in a 'Non-Responding' or 'Connecting' state due to a general network connectivity error, rather than a host error.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2015-02-11 18:06:13 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1084611, 1142923, 1156165 |
Description
Scott Herold
2014-07-17 19:37:28 UTC
(In reply to Scott Herold from comment #0) > Infra/Engine component to improve fencing logic > > When Triggered > -------------- > This action is triggered once RHEV-M had made the decision that a target > host may need to be fenced. Prior to issuing a fencing command, a user may > specify an optional time value to wait to investigate a percentage or number > of total hosts that have moved to a non-operational state. The additional timeout is redundant there are too much timeout parameters as is in the fencing flow. This decision should be done after the SSH vdsm restart that is happening somewhere in the beginning of the fencing flow. > > In this workflow, a calculation will be done over a short period of time > that will determine if one or two optional conditions are met before sending > fencing commands to proxy hosts: > > 1) If more than X% of hosts are non-operational, do not fence any host > 2) If more than Y# of hosts are non-operational, do not fence any host I do not understand #2 why define it - what use case will it cover ? > > This will help prevent fence storms from leaving the engine to be > potentially picked up in retries and introducing a potential race condition > on fence requests. > > UX > -- > There will be an option in the Fencing Policy sub menu (Defined by BZ > 1118879) to set the following options: > > [Boolean] > "Do not fence if the following conditions are met" - Turns entire check on > or off. One or both % or # values must be specified if this option is > enabled. The above is very confusing ... one or both ? I think the # is redundant. > > "Time of Delay" - Amount of time to wait for multiple hosts to enter > non-responsive state > DEFAULT: 60 - Seconds above is redundant > > "Greater than X% of hosts fail" > DEFAULT: 50% > > "More than Y# of hosts fail" > DEFAULT: 0/Disabled OK to limit scope to % of hosts that have failed and removing number of hosts from the rule. Also OK to remove the timeout. Barak/Oved, with the simpler scope, realistically, is this a target for 3.5 or 3.6? (In reply to Scott Herold from comment #2) > OK to limit scope to % of hosts that have failed and removing number of > hosts from the rule. Also OK to remove the timeout. > > Barak/Oved, with the simpler scope, realistically, is this a target for 3.5 > or 3.6? Targeting for 3.5.0. Still waiting for an answer on what's the deadline for exception RFEs to be MODIFIED... the answer for that might change the target release, but we're targeted for 3.5.0 for now. F I had changed the BZ title to relate to the non-responding state which is network-based instead of the non-operational that is storage-based. Also, I had concluded with Scott that for the percentage calculations we will seraph for all Host in Non-Responding or Connecting state since upon network issue a Host goes for a while first to the Connecting state before it is marked as Non-Responding (In reply to Eli Mesika from comment #5) > will seraph seraph=>search Eli - Agree on the "Connecting" state. It will be the first indication of a potential networking problem if half of the hosts go into a non-responsive state. While we don't tack action on "Connecting", there may be overlap where we see a mix of non-responsive and connecting, and we want to include these in the total counts. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0158.html |