Description of problem: Sometimes we encounter intermittent failures which cause RHEV hosts to become non responsive and get fenced. Usually the various logs provide enough information to work out what went wrong, but occasionally debugging requires information to be gathered during the broken state. This can be very difficult to obtain when the failures can occur at any time, and business needs require bringing the failed host back on line as soon as possible. To address this, it would be good to have a RHEV debugging feature which allows it to run a specified hook script before fencing a host. This script could gather diagnostic information relevant to the problem at hand before the failed host is fenced and restarted.
how would the diagnostic be different than running say, log collector on the host? btw, we already ssh to the host before fending it to try and restart vdsm first
Running sosreport / logcollector on the host would be a common thing I'd envisage the script running. Other possibilities include: 1) taking a partial or complete rhev-m database snapshot (to capture the state at the time of the failure, rather than afterwards) 2) Invoking SysRq-t on the host to work out where processes are stuck 3) Forcing a kdump on the host to obtain a vmcore It may also be useful in some cases to capture specific relevant information instead of a full sosreport, in order to allow the host to be restarted sooner. Running this script before attempting a vdsm restart via ssh may make more sense than doing it just before fencing. Both variants could possibly be useful.
Now that we have soft-fencing (vdsm restart only) as a first step and only if it is failed we do a hard fencing (reboot) as I understand those hooks should be run after the soft-fencing stage if it failed , right ???
Eli, For the case I've been looking at in relation to this, then yes, running after soft-fencing failed would be correct. But I can think of scenarious where we'd want to run the diagnostics before attempting to restart vdsm. So, both possibilities could be useful.
Such an action may postpone the fencing for a long time (depends on what the script actually does). This is not predictable and may lead to many corner cases that influence policy and QOS issues. e.g. having an HA vm on the host targeted to be fenced, will lead to a situation the VM (HA = should be up all the time) will be less avaiable. Anyway this will not be done for 3.4, moving to rhevm-future Arthur ?
I'm aware that this will postpone fencing and interfere with HA guarantees. This is intended as a diagnostic feature. Having your machine keep falling over without information to tell you why will also mess up your HA guarantees.
Barak, agreed there might be unexpected implications with customs hooks before fencing a node, maybe we should have a commented-out warning in the hook directory?
This RFE is unclear in a few ways: - postponing the fencing operation to uncertain point in time - effects policy - Having the host in non-responsive (even after SSH vdsm restart) means that on most of the cases (I was careful and said most but it could be all ... as I didn't see such a case) it means that the host can't be accessed , than how would you envision the collection from that host to be done ? - What information are you missing ? I need at least an example of the kind of missing data (that can not be obtained in the standard logs on the host (vdsm,libvirt ... syslog)
- postponing the fencing operation to uncertain point in time - effects policy Obviously. The idea is this can be used when something is going wrong, and capturing the diagnostic information is more important than regular policies about downtime. - Having the host in non-responsive (even after SSH vdsm restart) means that on most of the cases (I was careful and said most but it could be all ... as I didn't see such a case) it means that the host can't be accessed , than how would you envision the collection from that host to be done ? I've worked on a bunch of cases where the host was marked Non-Responding, even though it was still accessible via ssh. In addition a script could also use ipmitool or similar to access the host console. Finally, in some cases it's useful to capture database information from the time of the failure; before that information is overwritten by the state of the rebooted host. That doesn't require accessing the host at all. - What information are you missing ? I need at least an example of the kind of missing data (that can not be obtained in the standard logs on the host (vdsm,libvirt ... syslog) Depends on the case, but possibilities include: * Database snapshot information from the time of the problem * Detailed ps outputs from the time of the problem * Other runtime status information from the time of the fault; e.g. /proc/NN/stack, /proc/meminfo * Cores (via gcore) from vdsm, libvirtd, qemu or other processes * Triggering a vmcore / crash dump
Closing old issues. If still relevant please provide the use case and re-open.
BZ<2>Jira Resync