Hide Forgot
Description of problem: RHEV does not migrate running VMs off a host that has been marked as non-operational after severing connection to the storage (iSCSI). Only after multipath is stopped (thus, vdsm restarts), the VMs are migrated off the host. Version-Release number of selected component (if applicable): 3.2 How reproducible: Every time Steps to Reproduce: 1. Sever the connection to the storage (e.g. Blade servers with iSCSI storage: ifdown bond0) 2. RHEV marks the host as non-operational 3. VMs are marked as not responsive Actual results: VMs are not migrated off the host, unless multipath is forcibly stopped (and vdsm restarts). Expected results: VMs should migrate off the host automatically (and the host should be fenced eventually). Additional info: This has been reported on a PowerEdge M620 system with iSCSI connection to the storage. Due to Blade nature, severing the connection must be done by ifdown since there's no cable to unplug directly on the host.
doron/michal - i wonder if this set of rules wouldn't be made more dyanmic if made into part of the pluggable scheduler?
didn't look in detail yet, but: if the vdsm gets stuck there's no way for us to migrate. The current engine logic should be all right, but it requires a working connection to the host
(In reply to Michal Skrivanek from comment #4) > didn't look in detail yet, but: > if the vdsm gets stuck there's no way for us to migrate. The current engine > logic should be all right, but it requires a working connection to the host I agree. Once vdsm is stuck any action may be problematic.
Dan, I would say that the host fencing is the only possible choice here, but it should be choosen by the engine, vdsm has nothing to do with it. Any ideas?
Yes, if Vdsm is stuck in can do nothing. Only Engine (or a human admin) can choose to fence the host. However, the real question is whether Vdsm was indeed stuck, and why. We've made great efforts to avoid a case where a blocked storage connection makes vdsm hang. It is a storage bug if it happens.
> Additional info: > This has been reported on a PowerEdge M620 system with iSCSI connection to the > storage. Due to Blade nature, severing the connection must be done by ifdown > since there's no cable to unplug directly on the host. Btw, does it mean, that only the storage connection was severed, or the same NIC was used to communicate with the engine?
Only the storage connection was severed to test and reproduce this scenario. Communication with the engine was never severed (moreover they are using separate NICs for each).
we can't do anything if the VM is stuck in kernel on storage issues. Otherwise all logic seems to work fine