Bug 1004945

Summary: RHEV does not migrate VMs off a host that has been marked as non-operational
Product: Red Hat Enterprise Virtualization Manager Reporter: Juan Manuel Santos <jsantos>
Component: vdsmAssignee: Michal Skrivanek <michal.skrivanek>
Status: CLOSED CANTFIX QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 3.2.0CC: acathrow, adevolder, bazulay, dallan, danken, dfediuck, iheim, jdenemar, jsantos, lpeer, michal.skrivanek, mprivozn, rhodain, yeylon
Target Milestone: ---   
Target Release: 3.4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: virt
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-31 11:15:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Juan Manuel Santos 2013-09-05 19:43:15 UTC
Description of problem:
RHEV does not migrate running VMs off a host that has been marked as non-operational after severing connection to the storage (iSCSI). Only after multipath is stopped (thus, vdsm restarts), the VMs are migrated off the host.

Version-Release number of selected component (if applicable):
3.2

How reproducible:
Every time

Steps to Reproduce:
1. Sever the connection to the storage (e.g. Blade servers with iSCSI storage: ifdown bond0)
2. RHEV marks the host as non-operational
3. VMs are marked as not responsive

Actual results:
VMs are not migrated off the host, unless multipath is forcibly stopped (and vdsm restarts).

Expected results:
VMs should migrate off the host automatically (and the host should be fenced eventually).

Additional info:
This has been reported on a PowerEdge M620 system with iSCSI connection to the storage. Due to Blade nature, severing the connection must be done by ifdown since there's no cable to unplug directly on the host.

Comment 3 Itamar Heim 2013-09-06 05:08:34 UTC
doron/michal - i wonder if this set of rules wouldn't be made more dyanmic if made into part of the pluggable scheduler?

Comment 4 Michal Skrivanek 2013-09-06 07:06:27 UTC
didn't look in detail yet, but:
if the vdsm gets stuck there's no way for us to migrate. The current engine logic should be all right, but it requires a working connection to the host

Comment 5 Doron Fediuck 2013-09-09 00:44:34 UTC
(In reply to Michal Skrivanek from comment #4)
> didn't look in detail yet, but:
> if the vdsm gets stuck there's no way for us to migrate. The current engine
> logic should be all right, but it requires a working connection to the host

I agree. Once vdsm is stuck any action may be problematic.

Comment 6 Saveliev Peter 2013-09-11 08:28:06 UTC
Dan, I would say that the host fencing is the only possible choice here, but it should be choosen by the engine, vdsm has nothing to do with it.

Any ideas?

Comment 7 Dan Kenigsberg 2013-09-11 16:00:31 UTC
Yes, if Vdsm is stuck in can do nothing. Only Engine (or a human admin) can choose to fence the host.

However, the real question is whether Vdsm was indeed stuck, and why. We've made great efforts to avoid a case where a blocked storage connection makes vdsm hang. It is a storage bug if it happens.

Comment 8 Saveliev Peter 2013-09-16 14:40:37 UTC
> Additional info:
> This has been reported on a PowerEdge M620 system with iSCSI connection to the
> storage. Due to Blade nature, severing the connection must be done by ifdown
> since there's no cable to unplug directly on the host.

Btw, does it mean, that only the storage connection was severed, or the same NIC was used to communicate with the engine?

Comment 9 Juan Manuel Santos 2013-09-16 15:05:48 UTC
Only the storage connection was severed to test and reproduce this scenario. Communication with the engine was never severed (moreover they are using separate NICs for each).

Comment 15 Michal Skrivanek 2014-01-31 11:15:06 UTC
we can't do anything if the VM is stuck in kernel on storage issues. Otherwise all logic seems to work fine