1063432 – Physically disconnecting blade from chassis does not trigger HA VMs to restart

Bug 1063432 - Physically disconnecting blade from chassis does not trigger HA VMs to restart

Summary: Physically disconnecting blade from chassis does not trigger HA VMs to restart

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.3.0
Hardware:	x86_64
OS:	All
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.4.0
Assignee:	Eli Mesika
QA Contact:	Tareq Alayan
Docs Contact:
URL:
Whiteboard:	infra
Depends On:
Blocks:	rhev3.4snapshot2
TreeView+	depends on / blocked

Reported:	2014-02-10 17:38 UTC by Jake Hunsaker
Modified:	2019-04-28 09:42 UTC (History)
CC List:	16 users (show)
Fixed In Version:	av6
Doc Type:	Bug Fix
Doc Text:	Previously, if a blade was physically disconnected from the chassis of a hypervisor, then virtual machines marked for high availability were not restarted on other hypervisors. This caused virtual machines to be marked as "Up" even though they were inaccessible. Now, once the blade is disconnected, Red Hat Enterprise Virtualization restarts the highly available virtual machines on other hypervisors.
Clone Of:
Environment:
Last Closed:	2014-06-09 15:03:45 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2014:0506	normal	SHIPPED_LIVE	Moderate: Red Hat Enterprise Virtualization Manager 3.4.0 update	2014-06-09 18:55:38 UTC
oVirt gerrit	26216	None	None	None	Never
oVirt gerrit	26243	None	None	None	Never

Description Jake Hunsaker 2014-02-10 17:38:58 UTC

Description of problem:

If a blade is physically disconnected from the chassis, VMs marked for HA are not restarted on other hypervisors

Version-Release number of selected component (if applicable):

rhevm 3.3.0
vdsm 4.13.2-0.6

How reproducible:

Customer reports 100%

Steps to Reproduce:
1. Have VMs marked for HA running on a blade hypervisor
2. Physically disconnect that blade from the chassis while live
3.

Actual results:

VMs are still marked as "Up" in RHEVM, but are obviously inaccessible (and are still reported as running on the hypervisor that was pulled from the chassis). VMs are not restarted on other hypervisors because of this

Once the blade is re-connected, RHEV will then restart the HA VMs on other hypervisors

Expected results:

HA VMs should automatically be restarted on other hypervisors once the blade is disconnected

Additional info:

Customer was attempting to simulate several different types of outages or physical problems in order to test the migration and HA functions of RHEV when he pulled the blade from the chassis

Comment 1 Itamar Heim 2014-02-10 22:17:12 UTC

was fencing configured for engine to fence the blade to know it is down?

Comment 2 Jake Hunsaker 2014-02-10 22:24:24 UTC

Yes, fencing is configured.

Comment 3 Doron Fediuck 2014-02-13 10:04:19 UTC

Can we get the engine log file from the relevant time?

Comment 5 Martin Sivák 2014-02-20 14:39:51 UTC

Ok so what I see in the log is:

21:28 - Host rhev4-11 unplugged
21:28 - Engine detected a network failure
21:28 - Low disk space warning :)
21:29 - Fencing using Ssh
21:30 - ssh timeouts, ipmi restart is invoked using rhev4-12 as proxy
21:30 - ipmi stop is invoked using rhev4-12 as proxy
21:30 - ipmi status reports Chassis power = Unknown due to timeout
21:31 - Primary PM Agent definitions are corrupted, Stop aborted
21:31 - Failed to verify Host rhev4-11 Restart status, Please Restart Host rhev4-11 manually
21:31 - VdsSTatus set to NonResponsive
21:31 - Failed to verify host rhev4-11 stop status. Have retried 18 times with delay of 10 seconds between each retry.
21:31 - Failed to power fence host rhev4-11 Please check the host status and it's power management settings, and then manually reboot it and click "Confirm Host Has Been Rebooted"
21:31 - Restart host action failed, updating host 816fc18a-afb5-4137-a5be-6db16a1d6845 (rhev4-11) to NonResponsive
21:36 - OnVdsDuringFailureTimer of vds rhev4-11 entered
21:38 - MigrateVm and MigrateVDS commands were issued 
21:39 - MigrateVm and MigrateVDS commands were issued again
21:40 - Host plugged back

It seems that engine was slowly getting to the point where it would start the VMs again. It was first trying all the less aggressive options.

On the other hand 12 minutes might be too long, but I believe all the timeouts are configurable.

Comment 6 Jake Hunsaker 2014-02-20 16:24:30 UTC

Martin, 

Just to clarify are you saying that the customer would need to adjust timeouts in their environment for this (i.e. NOTABUG)? Or that the defaults should be adjusted within engine? 

I'd say 12 minutes is definitely too long and as far as I'm aware, this customer has not adjusted any of the default settings.

Comment 7 Martin Sivák 2014-02-21 09:45:44 UTC

Hi Jake,

I only went through the logs so far and did the summary to save time or others who might be reading this bug, there is still some investigation going on.

To add some more data, the unplugged host was not SPM according to:

2014-02-14 21:27:22,289 INFO  starting spm on vds rhev4-12

That is important, because if they pulled out the SPM node, manual intervention would possibly be required.

I agree that 12 minutes is probably too long though.

Comment 8 Jiri Moskovcak 2014-03-03 08:23:35 UTC

Hi Jake,
there is a page[1] where you can find which values to tweak to make the timeout shorter (using the engine-config tool).

[1] http://www.ovirt.org/Automatic_Fencing

Comment 13 Jiri Moskovcak 2014-03-27 09:29:49 UTC

add 2)

- so that's why the engine is trying so hard to confirm the state of the blade and that's where the lag comes from

I'm really curious how the VMWare handles such scenario..

Comment 14 Jake Hunsaker 2014-03-27 13:36:07 UTC

Jiri,


Values changed were:

VDSAttemptsToResetCount=1  (down from 3)
TimeoutToResetVdsInSeconds=30 (down from 60)

We left vdsTimeout at 180 per http://www.ovirt.org/Sla/ha-timeouts. 
vdsConnectionTimeout was left at 2s, and vdsRetries was left at 0. 


Will ask customer to reset and due the longer test.

Comment 15 Jiri Moskovcak 2014-03-27 14:04:55 UTC

And please don't forget to restart the engine after changing the configuration.

Comment 16 Eli Mesika 2014-03-27 14:57:02 UTC

Looking at that with Omer we came to the conclusion that in  the case that PM agent stop operation fails we are not moving VMs to unknown.

This should be fixed by 

In RestartVdsCommand:: executeCommand in case stop failed  it should perform handleError from VdsNotRespondingTreatmentCommand which also clears the VMs and put them on UNKNOWN 

As a result of our findings, Putting the BZ on infra and taking the BZ

Will handle ASAP

Comment 17 Eli Mesika 2014-03-31 11:30:09 UTC

Please note that in case that the host was rebooted manually the user should still select the Host + right click and choose "Confirm that host has been rebooted" in order to get HA VMs run oo other host

Comment 21 Eli Mesika 2014-04-22 11:57:35 UTC

First, this should be tested on a non SPM hyper-visor 

Secondly, In the case that fencing fails, we can not tell what is the host status and user must manually right-click plus select "Confirm host has been rebooted"

Comment 22 Barak 2014-04-22 12:09:16 UTC

(In reply to Eli Mesika from comment #21)
> First, this should be tested on a non SPM hyper-visor 
> 
> Secondly, In the case that fencing fails, we can not tell what is the host
> status and user must manually right-click plus select "Confirm host has been
> rebooted"


Actually it can be tested on SPM ... but you'll need to reboot it and than mark host as rebooted.

Comment 24 Pavel Stehlik 2014-04-23 14:49:21 UTC

ok - av6.1

Comment 25 errata-xmlrpc 2014-06-09 15:03:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0506.html

Note You need to log in before you can comment on or make changes to this bug.