Description of problem: If a blade is physically disconnected from the chassis, VMs marked for HA are not restarted on other hypervisors Version-Release number of selected component (if applicable): rhevm 3.3.0 vdsm 4.13.2-0.6 How reproducible: Customer reports 100% Steps to Reproduce: 1. Have VMs marked for HA running on a blade hypervisor 2. Physically disconnect that blade from the chassis while live 3. Actual results: VMs are still marked as "Up" in RHEVM, but are obviously inaccessible (and are still reported as running on the hypervisor that was pulled from the chassis). VMs are not restarted on other hypervisors because of this Once the blade is re-connected, RHEV will then restart the HA VMs on other hypervisors Expected results: HA VMs should automatically be restarted on other hypervisors once the blade is disconnected Additional info: Customer was attempting to simulate several different types of outages or physical problems in order to test the migration and HA functions of RHEV when he pulled the blade from the chassis
was fencing configured for engine to fence the blade to know it is down?
Yes, fencing is configured.
Can we get the engine log file from the relevant time?
Ok so what I see in the log is: 21:28 - Host rhev4-11 unplugged 21:28 - Engine detected a network failure 21:28 - Low disk space warning :) 21:29 - Fencing using Ssh 21:30 - ssh timeouts, ipmi restart is invoked using rhev4-12 as proxy 21:30 - ipmi stop is invoked using rhev4-12 as proxy 21:30 - ipmi status reports Chassis power = Unknown due to timeout 21:31 - Primary PM Agent definitions are corrupted, Stop aborted 21:31 - Failed to verify Host rhev4-11 Restart status, Please Restart Host rhev4-11 manually 21:31 - VdsSTatus set to NonResponsive 21:31 - Failed to verify host rhev4-11 stop status. Have retried 18 times with delay of 10 seconds between each retry. 21:31 - Failed to power fence host rhev4-11 Please check the host status and it's power management settings, and then manually reboot it and click "Confirm Host Has Been Rebooted" 21:31 - Restart host action failed, updating host 816fc18a-afb5-4137-a5be-6db16a1d6845 (rhev4-11) to NonResponsive 21:36 - OnVdsDuringFailureTimer of vds rhev4-11 entered 21:38 - MigrateVm and MigrateVDS commands were issued 21:39 - MigrateVm and MigrateVDS commands were issued again 21:40 - Host plugged back It seems that engine was slowly getting to the point where it would start the VMs again. It was first trying all the less aggressive options. On the other hand 12 minutes might be too long, but I believe all the timeouts are configurable.
Martin, Just to clarify are you saying that the customer would need to adjust timeouts in their environment for this (i.e. NOTABUG)? Or that the defaults should be adjusted within engine? I'd say 12 minutes is definitely too long and as far as I'm aware, this customer has not adjusted any of the default settings.
Hi Jake, I only went through the logs so far and did the summary to save time or others who might be reading this bug, there is still some investigation going on. To add some more data, the unplugged host was not SPM according to: 2014-02-14 21:27:22,289 INFO starting spm on vds rhev4-12 That is important, because if they pulled out the SPM node, manual intervention would possibly be required. I agree that 12 minutes is probably too long though.
Hi Jake, there is a page[1] where you can find which values to tweak to make the timeout shorter (using the engine-config tool). [1] http://www.ovirt.org/Automatic_Fencing
add 2) - so that's why the engine is trying so hard to confirm the state of the blade and that's where the lag comes from I'm really curious how the VMWare handles such scenario..
Jiri, Values changed were: VDSAttemptsToResetCount=1 (down from 3) TimeoutToResetVdsInSeconds=30 (down from 60) We left vdsTimeout at 180 per http://www.ovirt.org/Sla/ha-timeouts. vdsConnectionTimeout was left at 2s, and vdsRetries was left at 0. Will ask customer to reset and due the longer test.
And please don't forget to restart the engine after changing the configuration.
Looking at that with Omer we came to the conclusion that in the case that PM agent stop operation fails we are not moving VMs to unknown. This should be fixed by In RestartVdsCommand:: executeCommand in case stop failed it should perform handleError from VdsNotRespondingTreatmentCommand which also clears the VMs and put them on UNKNOWN As a result of our findings, Putting the BZ on infra and taking the BZ Will handle ASAP
Please note that in case that the host was rebooted manually the user should still select the Host + right click and choose "Confirm that host has been rebooted" in order to get HA VMs run oo other host
First, this should be tested on a non SPM hyper-visor Secondly, In the case that fencing fails, we can not tell what is the host status and user must manually right-click plus select "Confirm host has been rebooted"
(In reply to Eli Mesika from comment #21) > First, this should be tested on a non SPM hyper-visor > > Secondly, In the case that fencing fails, we can not tell what is the host > status and user must manually right-click plus select "Confirm host has been > rebooted" Actually it can be tested on SPM ... but you'll need to reboot it and than mark host as rebooted.
ok - av6.1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2014-0506.html