Bug 1314377

Summary: Confusing message when the HE VM cannot be migrated to "failing" HE host
Product: [oVirt] ovirt-engine Reporter: Jiri Belka <jbelka>
Component: Backend.CoreAssignee: Andrej Krejcir <akrejcir>
Status: CLOSED CURRENTRELEASE QA Contact: Nikolai Sednev <nsednev>
Severity: medium Docs Contact:
Priority: low    
Version: 3.6.3.3CC: bugs, dfediuck, jbelka, mavital
Target Milestone: ovirt-4.2.0Keywords: Triaged
Target Release: 4.2.0Flags: rule-engine: ovirt-4.2+
rule-engine: planning_ack+
rgolan: devel_ack+
mavital: testing_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-20 11:01:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1512534    
Bug Blocks:    

Description Jiri Belka 2016-03-03 13:22:36 UTC
Description of problem:

I cannot migrate our HE VM to other host part of HE setup, msg in web UI is confusing:

~~~	
Operation Canceled
Error while executing action:

HostedEngine:

    Cannot migrate VM. There is no host that satisfies current scheduling constraints. See below for details:
    The host _slot-5b did not satisfy internal filter HA because it is not a Hosted Engine host..
~~~

It seems the problem is related to "score" of the host:

~~~
...
2016-03-03 14:10:40,374 DEBUG [org.ovirt.engine.core.bll.scheduling.SlaValidator] (ajp-/127.0.0.1:8702-15) [648b618b] hasPhysMemoryToRunVM: host '_slot-5b'; free memory is : 25104.0 MB (+ 0 MB pending); free swap
 is: 2047 MB, required memory is 16449.0 MB; Guest overhead 65 MB
2016-03-03 14:10:40,374 DEBUG [org.ovirt.engine.core.bll.scheduling.SlaValidator] (ajp-/127.0.0.1:8702-15) [648b618b] 16449.0 <= ???  27151.0
2016-03-03 14:10:40,374 DEBUG [org.ovirt.engine.core.bll.scheduling.SlaValidator] (ajp-/127.0.0.1:8702-15) [648b618b] hasOvercommitMemoryToRunVM: host '_slot-5b'; max scheduling memory : 40500.0 MB; required memo
ry is 16449.0 MB; Guest overhead 65 MB
2016-03-03 14:10:40,375 DEBUG [org.ovirt.engine.core.bll.scheduling.SlaValidator] (ajp-/127.0.0.1:8702-15) [648b618b] 16449.0 <= ???  40500.0
2016-03-03 14:10:40,377 DEBUG [org.ovirt.engine.core.bll.scheduling.policyunits.HostedEngineHAClusterFilterPolicyUnit] (ajp-/127.0.0.1:8702-15) [648b618b] Host '_slot-5b' was filtered out as it doesn't have a pos
itive score (the score is 0)
2016-03-03 14:10:40,377 INFO  [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (ajp-/127.0.0.1:8702-15) [648b618b] Candidate host '_slot-5b' ('229dfe21-9c3b-4ec3-8ffe-192ed87a408e') was filtered out by 'V
AR__FILTERTYPE__INTERNAL' filter 'HA' (correlation id: null)
2016-03-03 14:10:40,377 WARN  [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (ajp-/127.0.0.1:8702-15) [648b618b] CanDoAction of action 'MigrateVmToServer' failed for user admin@internal. Reasons: VAR__ACTIO
N__MIGRATE,VAR__TYPE__VM,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName _slot-5b,$filterName HA,VAR__DETAIL__NOT_HE_HOST,SCHEDULING_HOST_FILTERED_REASON_WITH_DETAIL
...
~~~

The destination host is HE env host so at least the msg in web ui should state what's wrong and not to say confusing info.

I'm not sure what's wrong with the host yet but this is --vm-status output:

~~~
--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : slot-2.example.com
Host ID                            : 1
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : b11ed098
Host timestamp                     : 83112


--== Host 2 status ==--

Status up-to-date                  : False
Hostname                           : slot-5b.example.com
Host ID                            : 2
Engine status                      : unknown stale-data
Score                              : 0
stopped                            : True
Local maintenance                  : False
crc32                              : a9080a8b
Host timestamp                     : 70588
~~~

Version-Release number of selected component (if applicable):
rhevm-backend-3.6.3.4-0.1.el6.noarch

How reproducible:
???

Steps to Reproduce:
1.
2.
3.

Actual results:
engine informs that a host part of HE env is not HE env host, even so "blacklisting" the host is probably caused by operational issue (score = 0)

Expected results:
engine should inform that dest HE env host is not operating correctly or something like that. for sure it should _not_ say that dest host is not HE env host if it in fact is (although it is having some issue for now)

Additional info:

Comment 2 Jiri Belka 2016-07-22 14:20:52 UTC
It seems there's something wrong in logic:

- i was migration from dell-r210ii-03 to dell-r210ii-04 but free memory computation was related to _source_ host (ie. dell-r210ii-03):

~~~
...
2016-07-22 14:13:08,648 DEBUG [org.ovirt.engine.core.bll.scheduling.policyunits.HostedEngineHAClusterFilterPolicyUnit] (default task-63) [3610c5f5] Host 'dell-r210ii-04' was filtered out as it doesn't have a posi
tive score (the score is 0)
2016-07-22 14:13:08,648 DEBUG [org.ovirt.engine.core.bll.scheduling.policyunits.HostedEngineHAClusterFilterPolicyUnit] (default task-63) [3610c5f5] Host 'dell-r210ii-04' was filtered out as it doesn't have a posi
tive score (the score is 0)
2016-07-22 14:13:08,648 INFO  [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (default task-63) [3610c5f5] Candidate host 'dell-r210ii-04' ('c088b29d-30ac-454b-890f-d09301758433') was filtered out by 'VA
R__FILTERTYPE__INTERNAL' filter 'HA' (correlation id: null)
2016-07-22 14:13:08,648 INFO  [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (default task-63) [3610c5f5] Candidate host 'dell-r210ii-04' ('c088b29d-30ac-454b-890f-d09301758433') was filtered out by 'VA
R__FILTERTYPE__INTERNAL' filter 'HA' (correlation id: null)

            ^^^ Candidate host 'dell-r210ii-04' !

2016-07-22 14:13:08,648 DEBUG [org.ovirt.engine.core.bll.scheduling.SlaValidator] (default task-63) [3610c5f5] hasPhysMemoryToRunVM: host 'dell-r210ii-03'; free memory is : 3848.0 MB (+ 0 MB pending); free swap i
s: 8191 MB, required memory is 4161.0 MB; Guest overhead 65 MB

            ^^^ hasPhysMemorytoRunVM: host 'dell-r210ii-03' !

2016-07-22 14:13:08,648 DEBUG [org.ovirt.engine.core.bll.scheduling.SlaValidator] (default task-63) [3610c5f5] hasPhysMemoryToRunVM: host 'dell-r210ii-03'; free memory is : 3848.0 MB (+ 0 MB pending); free swap i
s: 8191 MB, required memory is 4161.0 MB; Guest overhead 65 MB
2016-07-22 14:13:08,648 DEBUG [org.ovirt.engine.core.bll.scheduling.SlaValidator] (default task-63) [3610c5f5] 4161.0 <= ???  12039.0
2016-07-22 14:13:08,648 DEBUG [org.ovirt.engine.core.bll.scheduling.SlaValidator] (default task-63) [3610c5f5] 4161.0 <= ???  12039.0
2016-07-22 14:13:08,649 DEBUG [org.ovirt.engine.core.bll.scheduling.SlaValidator] (default task-63) [3610c5f5] hasOvercommitMemoryToRunVM: host 'dell-r210ii-03'; max scheduling memory : 3151.0 MB; required memory is 4161.0 MB; Guest overhead 65 MB
2016-07-22 14:13:08,649 DEBUG [org.ovirt.engine.core.bll.scheduling.SlaValidator] (default task-63) [3610c5f5] hasOvercommitMemoryToRunVM: host 'dell-r210ii-03'; max scheduling memory : 3151.0 MB; required memory is 4161.0 MB; Guest overhead 65 MB
2016-07-22 14:13:08,649 DEBUG [org.ovirt.engine.core.bll.scheduling.SlaValidator] (default task-63) [3610c5f5] 4161.0 <= ???  3151.0
2016-07-22 14:13:08,649 DEBUG [org.ovirt.engine.core.bll.scheduling.SlaValidator] (default task-63) [3610c5f5] 4161.0 <= ???  3151.0
2016-07-22 14:13:08,649 DEBUG [org.ovirt.engine.core.bll.scheduling.policyunits.MemoryPolicyUnit] (default task-63) [3610c5f5] Host 'dell-r210ii-03' is already too close to the memory overcommitment limit. It can only accept 3151.0 MB of additional memory load.
2016-07-22 14:13:08,649 DEBUG [org.ovirt.engine.core.bll.scheduling.policyunits.MemoryPolicyUnit] (default task-63) [3610c5f5] Host 'dell-r210ii-03' is already too close to the memory overcommitment limit. It can only accept 3151.0 MB of additional memory load.
2016-07-22 14:13:08,649 INFO  [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (default task-63) [3610c5f5] Candidate host 'dell-r210ii-03' ('320d4f9c-4360-40bb-af7e-d93dcc87401b') was filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'Memory' (correlation id: null)
2016-07-22 14:13:08,649 INFO  [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (default task-63) [3610c5f5] Candidate host 'dell-r210ii-03' ('320d4f9c-4360-40bb-af7e-d93dcc87401b') was filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'Memory' (correlation id: null)
2016-07-22 14:13:08,649 WARN  [org.ovirt.engine.core.bll.MigrateVmCommand] (default task-63) [3610c5f5] Validation of action 'MigrateVm' failed for user admin@internal-authz. Reasons: VAR__ACTION__MIGRATE,VAR__TYPE__VM,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName dell-r210ii-03,$filterName Memory,$availableMem 3151.000000,VAR__DETAIL__NOT_ENOUGH_MEMORY,SCHEDULING_HOST_FILTERED_REASON_WITH_DETAIL,VAR__FILTERTYPE__INTERNAL,$hostName dell-r210ii-04,$filterName HA,VAR__DETAIL__NOT_HE_HOST,SCHEDULING_HOST_FILTERED_REASON_WITH_DETAIL

                         ^^ uh?

2016-07-22 14:13:08,649 WARN  [org.ovirt.engine.core.bll.MigrateVmCommand] (default task-63) [3610c5f5] Validation of action 'MigrateVm' failed for user admin@internal-authz. Reasons: VAR__ACTION__MIGRATE,VAR__TY:
...
~~~

Comment 4 Jiri Belka 2016-07-22 14:24:52 UTC
My env was SHE, migrated from oVirt 3.6 to 4.0-snapshot but DC/CL compat was still 3.6 on 4.x engine.

ovirt-engine-4.0.3-0.0.master.20160720203246.git9c88731.el7.centos.noarch

my hosts:

oVirt Node Hypervisor release 3.6 (0.999.201607211021.el7.centos)
ovirt-release-host-node-4.0.2-1.el7.noarch

Comment 5 Yanir Quinn 2016-09-28 12:16:23 UTC
1.does the cluster have any scheduling policy (different from none)?
2.how was the migration done ? via the UI ? REST ? load balancing ?

Comment 6 Jiri Belka 2016-10-05 09:54:02 UTC
(In reply to Yanir from comment #5)
> 1.does the cluster have any scheduling policy (different from none)?
> 2.how was the migration done ? via the UI ? REST ? load balancing ?

My memory is little bit dim after two months since submitting this issue.

1. Nope
2. Via 'Migrate' button in UI

There are logs as attachments.

Comment 7 Yanir Quinn 2016-10-31 10:13:02 UTC
Regarding https://bugzilla.redhat.com/show_bug.cgi?id=1314377#c2 :

1.the source host appears in log messages because all hosts are taken into computation of the target host (unless you choose a specific host you want to migrate to)
2.the memory Debug messages gives you statistics regarding the memory state of each host, memory state is not the issue of filtering out a host (note again that these are debug messages) 

now for the main reason of this bug : 
"  Cannot migrate VM. There is no host that satisfies current scheduling constraints. See below for details:
    The host _slot-5b did not satisfy internal filter HA because it is not a Hosted Engine host.."

The reason for the score being zero (thus filtering it out as a not hosted engine host) might be that the engine hasn't noticed the deployment yet or something failed on the host. 

Since the score comes from VDSM we can add to the UI or Debug log message something like :  "... if it is a designated hosted engine host - check host status"    (this can be done by logging into any hosted engine host and issuing hosted-engine --vm-status command.) 

Other option is to leave the log message as it is.

Comment 8 Yanir Quinn 2016-10-31 10:37:41 UTC
This can be considered as a documentation bug.

Comment 9 Martin Sivák 2016-11-03 15:24:05 UTC
The hosted engine score was 0, because the host was not ready:

MainThread::INFO::2016-03-03 14:10:46,292::hosted_engine::875::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
MainThread::ERROR::2016-03-03 14:10:46,292::hosted_engine::845::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_domain_monitor) Failed to start monitoring domain (sd_uuid=23c03bb6-9889-4cbf-b7ad-55b9a2c70653, host_id=2): timeout during domain acquisition
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 452, in start_monitoring
    self._initialize_domain_monitor()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 846, in _initialize_domain_monitor
    raise Exception(msg)
Exception: Failed to start monitoring domain (sd_uuid=23c03bb6-9889-4cbf-b7ad-55b9a2c70653, host_id=2): timeout during domain acquisition

VDSM took a long time to synchronize its internals.

Comment 10 Martin Sivák 2016-11-03 15:26:02 UTC
This bug can be now considered just an error message update bug. We should distinguish between non-HE host and not operational, but HE enabled host in the scheduling result.

Comment 11 Nikolai Sednev 2017-11-30 15:20:57 UTC
Deployed over Gluster using:
ovirt-hosted-engine-ha-2.2.0-0.0.master.20171128125909.20171128125907.gitfa5daa6.el7.centos.noarch
ovirt-hosted-engine-setup-2.2.0-0.0.master.20171129192644.git440040c.el7.centos.noarch
ovirt-engine-appliance-4.2-20171129.1.el7.centos.noarch

Caused destination host to get zero score prior to migration of SHE-VM to it.
Tried to migrate SHE-VM to destination ha-host with zero score and received this message from UI:
"Operation Canceled
Error while executing action:

HostedEngine:

    Cannot migrate VM because the VM is in Not Responding status."

Moving to verified.

Comment 12 Sandro Bonazzola 2017-12-20 11:01:37 UTC
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.