Bug 974297

Summary: vm is not moved to "Unknown" when host becomes non-responsive and has no pm configured
Product: Red Hat Enterprise Virtualization Manager Reporter: Marina Kalinin <mkalinin>
Component: ovirt-engineAssignee: Omer Frenkel <ofrenkel>
Status: CLOSED DUPLICATE QA Contact:
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.2.0CC: acathrow, bazulay, byount, dyasny, hchiramm, iheim, lpeer, michal.skrivanek, Rhev-m-bugs, sputhenp, yeylon, ykaul
Target Milestone: ---Keywords: Regression, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: virt
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-06-16 10:58:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Engine.log none

Description Marina Kalinin 2013-06-13 21:54:13 UTC
Description of problem:
A running vm is not moved to "Unknown" when host becomes non-responsive.
Trying to open console of the vm fails with error: "Error while executing action SetVmTicket: Network error during communication with the Host."

Maybe this is a new feature I am not aware about, but as far as I know, all VMs should move into "Unknown" state, once the host, the VMs are running on, becomes non-responsive and cannot be fenced.

I tried the reproducer twice and the problem persists.

Version-Release number of selected component (if applicable):
rhevm-3.2.0-11.30.el6ev
vdsm-4.10.2-22.0.el6ev (on RHEL host)

How reproducible:
100%

Steps to Reproduce:
1. Start a VM on host A. Set it as HA 
(not that important for the test, but that's what I did).

2. ssh to the host A and disconnect network: either with 
# ifdown ethX
or
# service network stop


Actual results:
Host becomes non-responsive. VM is still reported as up and running.

Expected results:
VM should be reported as unknown.

Additional info:
After performing "confirm host was rebooted" on the non-responsive host, the vm was automatically started on another host in the cluster.

Comment 2 Marina Kalinin 2013-06-13 21:59:39 UTC
Created attachment 760989 [details]
Engine.log

Comment 4 Marina Kalinin 2013-06-13 22:44:58 UTC
2013-06-13 17:01:05,013 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (QuartzScheduler_Worker-54) [da6e479] Server failed to respond, vds_id = 32571578-460a-11e2-a75e-00163e758d0e, vds_name = rhevh-4, vm_count = 1, spm_status = None, non-responsive_timeout (seconds) = 60, error = java.net.NoRouteToHostException: No route to host
2013-06-13 17:01:05,023 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (pool-3-thread-49) [da6e479] ResourceManager::vdsNotResponding entered for Host 32571578-460a-11e2-a75e-00163e758d0e, rhevh-4.gsslab.rdu2.redhat.com
2013-06-13 17:01:05,039 WARN  [org.ovirt.engine.core.bll.VdsNotRespondingTreatmentCommand] (pool-3-thread-49) [da6e479] CanDoAction of action VdsNotRespondingTreatment failed. Reasons:VDS_FENCE_DISABLED
--------
So VdsNotRespondingTreatmentCommand.CanDoAction is in FenceVdsBaseCommand.java.
And from what I see 
~~~
    protected boolean canDoAction() {
        boolean retValue = false;
         ....
        if (getVds().getpm_enabled()
                && IsPowerManagementLegal(getVds().getStaticData(), getVdsGroup().getcompatibility_version().toString())) {
        ... // we do not fall in this category, since we don't have FENCING enabled
            // so we go directly to the last else with retValue set to false but we never get into HandleError method that is actually supposed to move VMs to unknown:

            if (!retValue) {
                HandleError();
            }
        else {
            addCanDoActionMessage(VdcBllMessages.VDS_FENCING_DISABLED);
        }
        getReturnValue().setSucceeded(retValue);
        return retValue;
~~~

Comment 6 Omer Frenkel 2013-06-16 10:58:43 UTC

*** This bug has been marked as a duplicate of bug 921521 ***