Bug 967328 - PRD33 - add soft fencing over SSH (restart VDSM) as a preliminary step before fencing a None-Responsive host
Summary: PRD33 - add soft fencing over SSH (restart VDSM) as a preliminary step before...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.3.0
Assignee: Martin Perina
QA Contact: Artyom
URL:
Whiteboard: infra
: 590370 (view as bug list)
Depends On:
Blocks: 975301 978736 978737 1019470
TreeView+ depends on / blocked
 
Reported: 2013-05-26 16:53 UTC by Barak
Modified: 2016-02-10 19:37 UTC (History)
11 users (show)

Fixed In Version: is5
Doc Type: Enhancement
Doc Text:
This feature adds a new step to the flow of automatic fencing whereby non-responsive hosts are made responsive again faster and without having to perform real fencing. With this feature, VDSM is restarted using an SSH connection when a host is non-responsive, and real fencing is executed if this restart does not make the host responsive again. For more information, see http://www.ovirt.org/Automatic_Fencing#Automatic_Fencing_in_oVirt_3.3
Clone Of:
: 975301 (view as bug list)
Environment:
Last Closed: 2014-01-21 17:23:02 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
engine.log (101.74 KB, text/x-log)
2013-07-09 07:47 UTC, Artyom
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2014:0038 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Virtualization Manager 3.3.0 update 2014-01-21 22:03:06 UTC
oVirt gerrit 15797 0 None None None Never
oVirt gerrit 15798 0 None None None Never

Description Barak 2013-05-26 16:53:15 UTC
Currently the engine supports external fencing devices that are used in 2 different use cases:
a. Storage/Clustering related 
b. None-responsive host

This bug is about flow b (none responsive) and is a result of issues found and discussed on Bug 924801.

The problem:
Host becomes none-responsive due to an unexpected problem (Causing the VDSM to not respond over time) but the VMs themselves continue to live and be accessible.
A simple restart of VDSM can solve the issue

Current status:
The host will be fenced by external fencing device and all the VMs will die (some might be sensitive)

The solution:
Add a preliminary stage to none-responsive fencing flow to try and restart the VDSM over SSH before doing the actual external fencing. Failing to perform this preliminary stage should mean continue to the external fencing, succeeding doing the restart should give the host additional grace time to recover, but if not recovered will also lead to the external fencing.

Open Issues:
- what should we implement for hosts without an external fencing device ? should we add a host restart vis SSH ?

Comment 1 Itamar Heim 2013-05-26 18:26:00 UTC
I'd consider a checkbox of 'try to soft fence via ssh first' to power management or something like that.

Comment 2 Simon Grinberg 2013-06-03 08:06:17 UTC
*** Bug 590370 has been marked as a duplicate of this bug. ***

Comment 4 Artyom 2013-07-08 10:22:32 UTC
Open Issues:
- what should we implement for hosts without an external fencing device ? should we add a host restart vis SSH ?

What about this issue?

Comment 5 Martin Perina 2013-07-08 11:17:31 UTC
Please look at following URL to see how exactly SSH Soft Fencing is supposed to work in oVirt 3.3:

http://www.ovirt.org/Automatic_Fencing#Automatic_Fencing_in_oVirt_3.3
http://lists.ovirt.org/pipermail/engine-devel/2013-July/005080.html

Comment 6 Artyom 2013-07-09 07:47:52 UTC
Created attachment 770831 [details]
engine.log

Comment 7 Artyom 2013-07-09 07:48:36 UTC
Tests under:
rhevm - is4
host with power management - vdsm-4.10.2-19.0.el6ev
host without power management - vdsm-4.11.0-89.git8bf916a.el6
No ssh fencing
Via host without power management, after manually stopped vdsmd, host in state non-responding and not try start vdsmd via ssh.
Via host with power management, host do fence but no commands with ssh in engine.log.
See engine log for more details.

Comment 8 Martin Perina 2013-07-11 04:58:44 UTC
I've looked at source code for RHEVM is5 and SSH Soft Fencing is included. Please retest.

Comment 9 Artyom 2013-07-14 12:17:26 UTC
Tests under:
rhevm - is5
host with power management - vdsm-4.11.0-121.git082925a.el6
host without power management - vdsm-4.11.0-121.git082925a.el6
Verified
Via host without power management, after manually stopped vdsmd, host 1-2 minutes in connecting state and after this host in up state.

Via host with power management, host do sshSoftFencing if ssh enabled, if not do fence.

Comment 10 Charlie 2013-11-28 00:23:29 UTC
This bug is currently attached to errata RHEA-2013:15231. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 11 errata-xmlrpc 2014-01-21 17:23:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0038.html


Note You need to log in before you can comment on or make changes to this bug.