Description of problem:
If a blade is physically disconnected from the chassis, VMs marked for HA are not restarted on other hypervisors
Version-Release number of selected component (if applicable):
Customer reports 100%
Steps to Reproduce:
1. Have VMs marked for HA running on a blade hypervisor
2. Physically disconnect that blade from the chassis while live
VMs are still marked as "Up" in RHEVM, but are obviously inaccessible (and are still reported as running on the hypervisor that was pulled from the chassis). VMs are not restarted on other hypervisors because of this
Once the blade is re-connected, RHEV will then restart the HA VMs on other hypervisors
HA VMs should automatically be restarted on other hypervisors once the blade is disconnected
Customer was attempting to simulate several different types of outages or physical problems in order to test the migration and HA functions of RHEV when he pulled the blade from the chassis
was fencing configured for engine to fence the blade to know it is down?
Yes, fencing is configured.
Can we get the engine log file from the relevant time?
Ok so what I see in the log is:
21:28 - Host rhev4-11 unplugged
21:28 - Engine detected a network failure
21:28 - Low disk space warning :)
21:29 - Fencing using Ssh
21:30 - ssh timeouts, ipmi restart is invoked using rhev4-12 as proxy
21:30 - ipmi stop is invoked using rhev4-12 as proxy
21:30 - ipmi status reports Chassis power = Unknown due to timeout
21:31 - Primary PM Agent definitions are corrupted, Stop aborted
21:31 - Failed to verify Host rhev4-11 Restart status, Please Restart Host rhev4-11 manually
21:31 - VdsSTatus set to NonResponsive
21:31 - Failed to verify host rhev4-11 stop status. Have retried 18 times with delay of 10 seconds between each retry.
21:31 - Failed to power fence host rhev4-11 Please check the host status and it's power management settings, and then manually reboot it and click "Confirm Host Has Been Rebooted"
21:31 - Restart host action failed, updating host 816fc18a-afb5-4137-a5be-6db16a1d6845 (rhev4-11) to NonResponsive
21:36 - OnVdsDuringFailureTimer of vds rhev4-11 entered
21:38 - MigrateVm and MigrateVDS commands were issued
21:39 - MigrateVm and MigrateVDS commands were issued again
21:40 - Host plugged back
It seems that engine was slowly getting to the point where it would start the VMs again. It was first trying all the less aggressive options.
On the other hand 12 minutes might be too long, but I believe all the timeouts are configurable.
Just to clarify are you saying that the customer would need to adjust timeouts in their environment for this (i.e. NOTABUG)? Or that the defaults should be adjusted within engine?
I'd say 12 minutes is definitely too long and as far as I'm aware, this customer has not adjusted any of the default settings.
I only went through the logs so far and did the summary to save time or others who might be reading this bug, there is still some investigation going on.
To add some more data, the unplugged host was not SPM according to:
2014-02-14 21:27:22,289 INFO starting spm on vds rhev4-12
That is important, because if they pulled out the SPM node, manual intervention would possibly be required.
I agree that 12 minutes is probably too long though.
there is a page where you can find which values to tweak to make the timeout shorter (using the engine-config tool).
- so that's why the engine is trying so hard to confirm the state of the blade and that's where the lag comes from
I'm really curious how the VMWare handles such scenario..
Values changed were:
VDSAttemptsToResetCount=1 (down from 3)
TimeoutToResetVdsInSeconds=30 (down from 60)
We left vdsTimeout at 180 per http://www.ovirt.org/Sla/ha-timeouts.
vdsConnectionTimeout was left at 2s, and vdsRetries was left at 0.
Will ask customer to reset and due the longer test.
And please don't forget to restart the engine after changing the configuration.
Looking at that with Omer we came to the conclusion that in the case that PM agent stop operation fails we are not moving VMs to unknown.
This should be fixed by
In RestartVdsCommand:: executeCommand in case stop failed it should perform handleError from VdsNotRespondingTreatmentCommand which also clears the VMs and put them on UNKNOWN
As a result of our findings, Putting the BZ on infra and taking the BZ
Will handle ASAP
Please note that in case that the host was rebooted manually the user should still select the Host + right click and choose "Confirm that host has been rebooted" in order to get HA VMs run oo other host
First, this should be tested on a non SPM hyper-visor
Secondly, In the case that fencing fails, we can not tell what is the host status and user must manually right-click plus select "Confirm host has been rebooted"
(In reply to Eli Mesika from comment #21)
> First, this should be tested on a non SPM hyper-visor
> Secondly, In the case that fencing fails, we can not tell what is the host
> status and user must manually right-click plus select "Confirm host has been
Actually it can be tested on SPM ... but you'll need to reboot it and than mark host as rebooted.
ok - av6.1
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.