Bug 974297

Summary:

vm is not moved to "Unknown" when host becomes non-responsive and has no pm configured

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Marina Kalinin <mkalinin>

Component:

ovirt-engine

Assignee:

Omer Frenkel <ofrenkel>

Status:

CLOSED DUPLICATE

QA Contact:

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

3.2.0

CC:

acathrow, bazulay, byount, dyasny, hchiramm, iheim, lpeer, michal.skrivanek, Rhev-m-bugs, sputhenp, yeylon, ykaul

Target Milestone:

---

Keywords:

Regression, Triaged

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

virt

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-06-16 10:58:43 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Engine.log	none

Description Marina Kalinin 2013-06-13 21:54:13 UTC

Description of problem:
A running vm is not moved to "Unknown" when host becomes non-responsive.
Trying to open console of the vm fails with error: "Error while executing action SetVmTicket: Network error during communication with the Host."

Maybe this is a new feature I am not aware about, but as far as I know, all VMs should move into "Unknown" state, once the host, the VMs are running on, becomes non-responsive and cannot be fenced.

I tried the reproducer twice and the problem persists.

Version-Release number of selected component (if applicable):
rhevm-3.2.0-11.30.el6ev
vdsm-4.10.2-22.0.el6ev (on RHEL host)

How reproducible:
100%

Steps to Reproduce:
1. Start a VM on host A. Set it as HA 
(not that important for the test, but that's what I did).

2. ssh to the host A and disconnect network: either with 
# ifdown ethX
or
# service network stop


Actual results:
Host becomes non-responsive. VM is still reported as up and running.

Expected results:
VM should be reported as unknown.

Additional info:
After performing "confirm host was rebooted" on the non-responsive host, the vm was automatically started on another host in the cluster.

Comment 2 Marina Kalinin 2013-06-13 21:59:39 UTC

Created attachment 760989 [details]
Engine.log

Comment 4 Marina Kalinin 2013-06-13 22:44:58 UTC

2013-06-13 17:01:05,013 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (QuartzScheduler_Worker-54) [da6e479] Server failed to respond, vds_id = 32571578-460a-11e2-a75e-00163e758d0e, vds_name = rhevh-4, vm_count = 1, spm_status = None, non-responsive_timeout (seconds) = 60, error = java.net.NoRouteToHostException: No route to host
2013-06-13 17:01:05,023 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (pool-3-thread-49) [da6e479] ResourceManager::vdsNotResponding entered for Host 32571578-460a-11e2-a75e-00163e758d0e, rhevh-4.gsslab.rdu2.redhat.com
2013-06-13 17:01:05,039 WARN  [org.ovirt.engine.core.bll.VdsNotRespondingTreatmentCommand] (pool-3-thread-49) [da6e479] CanDoAction of action VdsNotRespondingTreatment failed. Reasons:VDS_FENCE_DISABLED
--------
So VdsNotRespondingTreatmentCommand.CanDoAction is in FenceVdsBaseCommand.java.
And from what I see 
~~~
    protected boolean canDoAction() {
        boolean retValue = false;
         ....
        if (getVds().getpm_enabled()
                && IsPowerManagementLegal(getVds().getStaticData(), getVdsGroup().getcompatibility_version().toString())) {
        ... // we do not fall in this category, since we don't have FENCING enabled
            // so we go directly to the last else with retValue set to false but we never get into HandleError method that is actually supposed to move VMs to unknown:

            if (!retValue) {
                HandleError();
            }
        else {
            addCanDoActionMessage(VdcBllMessages.VDS_FENCING_DISABLED);
        }
        getReturnValue().setSucceeded(retValue);
        return retValue;
~~~

Comment 6 Omer Frenkel 2013-06-16 10:58:43 UTC


*** This bug has been marked as a duplicate of bug 921521 ***