Bug 1021744 - [RFE] Add hooks to power management to automatically run diagnostic scripts before fencing
Summary: [RFE] Add hooks to power management to automatically run diagnostic scripts b...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.2.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Eli Mesika
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-10-22 00:55 UTC by David Gibson
Modified: 2019-05-16 13:08 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-24 13:50:03 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:
sherold: Triaged+


Attachments (Terms of Use)

Description David Gibson 2013-10-22 00:55:52 UTC
Description of problem:

Sometimes we encounter intermittent failures which cause RHEV hosts to become non responsive and get fenced.  Usually the various logs provide enough information to work out what went wrong, but occasionally debugging requires information to be gathered during the broken state.

This can be very difficult to obtain when the failures can occur at any time, and business needs require bringing the failed host back on line as soon as possible.

To address this, it would be good to have a RHEV debugging feature which allows it to run a specified hook script before fencing a host.  This script could gather diagnostic information relevant to the problem at hand before the failed host is fenced and restarted.

Comment 2 Itamar Heim 2013-10-22 05:02:00 UTC
how would the diagnostic be different than running say, log collector on the host?

btw, we already ssh to the host before fending it to try and restart vdsm first

Comment 3 David Gibson 2013-10-22 05:49:39 UTC
Running sosreport / logcollector on the host would be a common thing I'd envisage the script running.  Other possibilities include:
  1) taking a partial or complete rhev-m database snapshot (to capture the state at the time of the failure, rather than afterwards)
  2) Invoking SysRq-t on the host to work out where processes are stuck
  3) Forcing a kdump on the host to obtain a vmcore

It may also be useful in some cases to capture specific relevant information instead of a full sosreport, in order to allow the host to be restarted sooner.

Running this script before attempting a vdsm restart via ssh may make more sense than doing it just before fencing.  Both variants could possibly be useful.

Comment 4 Eli Mesika 2013-11-06 11:26:11 UTC
Now that we have soft-fencing (vdsm restart only) as a first step and only if it is failed we do a hard fencing (reboot) as I understand those hooks should be run after the soft-fencing stage if it failed , right ???

Comment 5 David Gibson 2013-11-07 06:19:18 UTC
Eli,

For the case I've been looking at in relation to this, then yes, running after soft-fencing failed would be correct.  But I can think of scenarious where we'd want to run the diagnostics before attempting to restart vdsm.  So, both possibilities could be useful.

Comment 6 Barak 2013-11-27 14:18:01 UTC
Such an action may postpone the fencing for a long time (depends on what the script actually does).
This is not predictable and may lead to many corner cases that influence policy and QOS issues. e.g. having an HA vm on the host targeted to be fenced, will lead to a situation the VM (HA = should be up all the time) will be less avaiable.

Anyway this will not be done for 3.4, moving to rhevm-future


Arthur ?

Comment 7 David Gibson 2013-11-27 22:50:42 UTC
I'm aware that this will postpone fencing and interfere with HA guarantees.  This is intended as a diagnostic feature.  Having your machine keep falling over without information to tell you why will also mess up your HA guarantees.

Comment 8 Arthur Berezin 2013-12-14 20:57:37 UTC
Barak, agreed there might be unexpected implications with customs hooks before fencing a node, maybe we should have a commented-out warning in the hook directory?

Comment 9 Barak 2014-06-18 11:02:18 UTC
This RFE is unclear in a few ways:
- postponing the fencing operation to uncertain point in time - effects policy 
- Having the host in non-responsive (even after SSH vdsm restart) means that on 
  most of the cases (I was careful and said most but it could be all ... as I 
  didn't see such a case) it means that the host can't be accessed , than how 
  would you envision the collection from that host to be done ?
- What information are you missing ? I need at least an example of the kind of 
  missing data (that can not be obtained in the standard logs on the host 
  (vdsm,libvirt ... syslog)

Comment 10 David Gibson 2014-06-19 02:05:02 UTC
- postponing the fencing operation to uncertain point in time - effects policy

Obviously.  The idea is this can be used when something is going wrong, and capturing the diagnostic information is more important than regular policies about downtime.

- Having the host in non-responsive (even after SSH vdsm restart) means that on 
  most of the cases (I was careful and said most but it could be all ... as I 
  didn't see such a case) it means that the host can't be accessed , than how 
  would you envision the collection from that host to be done ?

I've worked on a bunch of cases where the host was marked Non-Responding, even though it was still accessible via ssh.  In addition a script could also use ipmitool or similar to access the host console.

Finally, in some cases it's useful to capture database information from the time of the failure; before that information is overwritten by the state of the rebooted host.  That doesn't require accessing the host at all.

- What information are you missing ? I need at least an example of the kind of 
  missing data (that can not be obtained in the standard logs on the host 
  (vdsm,libvirt ... syslog)

Depends on the case, but possibilities include:
  * Database snapshot information from the time of the problem
  * Detailed ps outputs from the time of the problem
  * Other runtime status information from the time of the fault; e.g. /proc/NN/stack, /proc/meminfo
  * Cores (via gcore) from vdsm, libvirtd, qemu or other processes
  * Triggering a vmcore / crash dump

Comment 11 Doron Fediuck 2018-06-24 13:50:03 UTC
Closing old issues.
If still relevant please provide the use case and re-open.

Comment 12 Franta Kust 2019-05-16 13:08:51 UTC
BZ<2>Jira Resync


Note You need to log in before you can comment on or make changes to this bug.