Bug 1004945 - RHEV does not migrate VMs off a host that has been marked as non-operational
RHEV does not migrate VMs off a host that has been marked as non-operational
Status: CLOSED CANTFIX
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
3.2.0
x86_64 Linux
high Severity high
: ---
: 3.4.0
Assigned To: Michal Skrivanek
virt
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-09-05 15:43 EDT by Juan Manuel Santos
Modified: 2014-01-31 06:15 EST (History)
14 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-01-31 06:15:06 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Juan Manuel Santos 2013-09-05 15:43:15 EDT
Description of problem:
RHEV does not migrate running VMs off a host that has been marked as non-operational after severing connection to the storage (iSCSI). Only after multipath is stopped (thus, vdsm restarts), the VMs are migrated off the host.

Version-Release number of selected component (if applicable):
3.2

How reproducible:
Every time

Steps to Reproduce:
1. Sever the connection to the storage (e.g. Blade servers with iSCSI storage: ifdown bond0)
2. RHEV marks the host as non-operational
3. VMs are marked as not responsive

Actual results:
VMs are not migrated off the host, unless multipath is forcibly stopped (and vdsm restarts).

Expected results:
VMs should migrate off the host automatically (and the host should be fenced eventually).

Additional info:
This has been reported on a PowerEdge M620 system with iSCSI connection to the storage. Due to Blade nature, severing the connection must be done by ifdown since there's no cable to unplug directly on the host.
Comment 3 Itamar Heim 2013-09-06 01:08:34 EDT
doron/michal - i wonder if this set of rules wouldn't be made more dyanmic if made into part of the pluggable scheduler?
Comment 4 Michal Skrivanek 2013-09-06 03:06:27 EDT
didn't look in detail yet, but:
if the vdsm gets stuck there's no way for us to migrate. The current engine logic should be all right, but it requires a working connection to the host
Comment 5 Doron Fediuck 2013-09-08 20:44:34 EDT
(In reply to Michal Skrivanek from comment #4)
> didn't look in detail yet, but:
> if the vdsm gets stuck there's no way for us to migrate. The current engine
> logic should be all right, but it requires a working connection to the host

I agree. Once vdsm is stuck any action may be problematic.
Comment 6 Saveliev Peter 2013-09-11 04:28:06 EDT
Dan, I would say that the host fencing is the only possible choice here, but it should be choosen by the engine, vdsm has nothing to do with it.

Any ideas?
Comment 7 Dan Kenigsberg 2013-09-11 12:00:31 EDT
Yes, if Vdsm is stuck in can do nothing. Only Engine (or a human admin) can choose to fence the host.

However, the real question is whether Vdsm was indeed stuck, and why. We've made great efforts to avoid a case where a blocked storage connection makes vdsm hang. It is a storage bug if it happens.
Comment 8 Saveliev Peter 2013-09-16 10:40:37 EDT
> Additional info:
> This has been reported on a PowerEdge M620 system with iSCSI connection to the
> storage. Due to Blade nature, severing the connection must be done by ifdown
> since there's no cable to unplug directly on the host.

Btw, does it mean, that only the storage connection was severed, or the same NIC was used to communicate with the engine?
Comment 9 Juan Manuel Santos 2013-09-16 11:05:48 EDT
Only the storage connection was severed to test and reproduce this scenario. Communication with the engine was never severed (moreover they are using separate NICs for each).
Comment 15 Michal Skrivanek 2014-01-31 06:15:06 EST
we can't do anything if the VM is stuck in kernel on storage issues. Otherwise all logic seems to work fine

Note You need to log in before you can comment on or make changes to this bug.