Bug 1314377 - Confusing message when the HE VM cannot be migrated to "failing" HE host
Confusing message when the HE VM cannot be migrated to "failing" HE host
Status: MODIFIED
Product: ovirt-engine
Classification: oVirt
Component: Backend.Core (Show other bugs)
3.6.3.3
Unspecified Unspecified
low Severity medium (vote)
: ovirt-4.2.0
: ---
Assigned To: Andrej Krejcir
Nikolai Sednev
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-03-03 08:22 EST by Jiri Belka
Modified: 2017-11-21 04:38 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: SLA
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
dfediuck: ovirt‑4.2?
rule-engine: planning_ack+
rgolan: devel_ack+
rule-engine: testing_ack?


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 84177 master MERGED core: Fix error message in HostedEngineHAClusterFilterPolicyUnit 2017-11-20 10:15 EST

  None (edit)
Description Jiri Belka 2016-03-03 08:22:36 EST
Description of problem:

I cannot migrate our HE VM to other host part of HE setup, msg in web UI is confusing:

~~~	
Operation Canceled
Error while executing action:

HostedEngine:

    Cannot migrate VM. There is no host that satisfies current scheduling constraints. See below for details:
    The host _slot-5b did not satisfy internal filter HA because it is not a Hosted Engine host..
~~~

It seems the problem is related to "score" of the host:

~~~
...
2016-03-03 14:10:40,374 DEBUG [org.ovirt.engine.core.bll.scheduling.SlaValidator] (ajp-/127.0.0.1:8702-15) [648b618b] hasPhysMemoryToRunVM: host '_slot-5b'; free memory is : 25104.0 MB (+ 0 MB pending); free swap
 is: 2047 MB, required memory is 16449.0 MB; Guest overhead 65 MB
2016-03-03 14:10:40,374 DEBUG [org.ovirt.engine.core.bll.scheduling.SlaValidator] (ajp-/127.0.0.1:8702-15) [648b618b] 16449.0 <= ???  27151.0
2016-03-03 14:10:40,374 DEBUG [org.ovirt.engine.core.bll.scheduling.SlaValidator] (ajp-/127.0.0.1:8702-15) [648b618b] hasOvercommitMemoryToRunVM: host '_slot-5b'; max scheduling memory : 40500.0 MB; required memo
ry is 16449.0 MB; Guest overhead 65 MB
2016-03-03 14:10:40,375 DEBUG [org.ovirt.engine.core.bll.scheduling.SlaValidator] (ajp-/127.0.0.1:8702-15) [648b618b] 16449.0 <= ???  40500.0
2016-03-03 14:10:40,377 DEBUG [org.ovirt.engine.core.bll.scheduling.policyunits.HostedEngineHAClusterFilterPolicyUnit] (ajp-/127.0.0.1:8702-15) [648b618b] Host '_slot-5b' was filtered out as it doesn't have a pos
itive score (the score is 0)
2016-03-03 14:10:40,377 INFO  [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (ajp-/127.0.0.1:8702-15) [648b618b] Candidate host '_slot-5b' ('229dfe21-9c3b-4ec3-8ffe-192ed87a408e') was filtered out by 'V
AR__FILTERTYPE__INTERNAL' filter 'HA' (correlation id: null)
2016-03-03 14:10:40,377 WARN  [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (ajp-/127.0.0.1:8702-15) [648b618b] CanDoAction of action 'MigrateVmToServer' failed for user admin@internal. Reasons: VAR__ACTIO
N__MIGRATE,VAR__TYPE__VM,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName _slot-5b,$filterName HA,VAR__DETAIL__NOT_HE_HOST,SCHEDULING_HOST_FILTERED_REASON_WITH_DETAIL
...
~~~

The destination host is HE env host so at least the msg in web ui should state what's wrong and not to say confusing info.

I'm not sure what's wrong with the host yet but this is --vm-status output:

~~~
--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : slot-2.example.com
Host ID                            : 1
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : b11ed098
Host timestamp                     : 83112


--== Host 2 status ==--

Status up-to-date                  : False
Hostname                           : slot-5b.example.com
Host ID                            : 2
Engine status                      : unknown stale-data
Score                              : 0
stopped                            : True
Local maintenance                  : False
crc32                              : a9080a8b
Host timestamp                     : 70588
~~~

Version-Release number of selected component (if applicable):
rhevm-backend-3.6.3.4-0.1.el6.noarch

How reproducible:
???

Steps to Reproduce:
1.
2.
3.

Actual results:
engine informs that a host part of HE env is not HE env host, even so "blacklisting" the host is probably caused by operational issue (score = 0)

Expected results:
engine should inform that dest HE env host is not operating correctly or something like that. for sure it should _not_ say that dest host is not HE env host if it in fact is (although it is having some issue for now)

Additional info:
Comment 2 Jiri Belka 2016-07-22 10:20:52 EDT
It seems there's something wrong in logic:

- i was migration from dell-r210ii-03 to dell-r210ii-04 but free memory computation was related to _source_ host (ie. dell-r210ii-03):

~~~
...
2016-07-22 14:13:08,648 DEBUG [org.ovirt.engine.core.bll.scheduling.policyunits.HostedEngineHAClusterFilterPolicyUnit] (default task-63) [3610c5f5] Host 'dell-r210ii-04' was filtered out as it doesn't have a posi
tive score (the score is 0)
2016-07-22 14:13:08,648 DEBUG [org.ovirt.engine.core.bll.scheduling.policyunits.HostedEngineHAClusterFilterPolicyUnit] (default task-63) [3610c5f5] Host 'dell-r210ii-04' was filtered out as it doesn't have a posi
tive score (the score is 0)
2016-07-22 14:13:08,648 INFO  [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (default task-63) [3610c5f5] Candidate host 'dell-r210ii-04' ('c088b29d-30ac-454b-890f-d09301758433') was filtered out by 'VA
R__FILTERTYPE__INTERNAL' filter 'HA' (correlation id: null)
2016-07-22 14:13:08,648 INFO  [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (default task-63) [3610c5f5] Candidate host 'dell-r210ii-04' ('c088b29d-30ac-454b-890f-d09301758433') was filtered out by 'VA
R__FILTERTYPE__INTERNAL' filter 'HA' (correlation id: null)

            ^^^ Candidate host 'dell-r210ii-04' !

2016-07-22 14:13:08,648 DEBUG [org.ovirt.engine.core.bll.scheduling.SlaValidator] (default task-63) [3610c5f5] hasPhysMemoryToRunVM: host 'dell-r210ii-03'; free memory is : 3848.0 MB (+ 0 MB pending); free swap i
s: 8191 MB, required memory is 4161.0 MB; Guest overhead 65 MB

            ^^^ hasPhysMemorytoRunVM: host 'dell-r210ii-03' !

2016-07-22 14:13:08,648 DEBUG [org.ovirt.engine.core.bll.scheduling.SlaValidator] (default task-63) [3610c5f5] hasPhysMemoryToRunVM: host 'dell-r210ii-03'; free memory is : 3848.0 MB (+ 0 MB pending); free swap i
s: 8191 MB, required memory is 4161.0 MB; Guest overhead 65 MB
2016-07-22 14:13:08,648 DEBUG [org.ovirt.engine.core.bll.scheduling.SlaValidator] (default task-63) [3610c5f5] 4161.0 <= ???  12039.0
2016-07-22 14:13:08,648 DEBUG [org.ovirt.engine.core.bll.scheduling.SlaValidator] (default task-63) [3610c5f5] 4161.0 <= ???  12039.0
2016-07-22 14:13:08,649 DEBUG [org.ovirt.engine.core.bll.scheduling.SlaValidator] (default task-63) [3610c5f5] hasOvercommitMemoryToRunVM: host 'dell-r210ii-03'; max scheduling memory : 3151.0 MB; required memory is 4161.0 MB; Guest overhead 65 MB
2016-07-22 14:13:08,649 DEBUG [org.ovirt.engine.core.bll.scheduling.SlaValidator] (default task-63) [3610c5f5] hasOvercommitMemoryToRunVM: host 'dell-r210ii-03'; max scheduling memory : 3151.0 MB; required memory is 4161.0 MB; Guest overhead 65 MB
2016-07-22 14:13:08,649 DEBUG [org.ovirt.engine.core.bll.scheduling.SlaValidator] (default task-63) [3610c5f5] 4161.0 <= ???  3151.0
2016-07-22 14:13:08,649 DEBUG [org.ovirt.engine.core.bll.scheduling.SlaValidator] (default task-63) [3610c5f5] 4161.0 <= ???  3151.0
2016-07-22 14:13:08,649 DEBUG [org.ovirt.engine.core.bll.scheduling.policyunits.MemoryPolicyUnit] (default task-63) [3610c5f5] Host 'dell-r210ii-03' is already too close to the memory overcommitment limit. It can only accept 3151.0 MB of additional memory load.
2016-07-22 14:13:08,649 DEBUG [org.ovirt.engine.core.bll.scheduling.policyunits.MemoryPolicyUnit] (default task-63) [3610c5f5] Host 'dell-r210ii-03' is already too close to the memory overcommitment limit. It can only accept 3151.0 MB of additional memory load.
2016-07-22 14:13:08,649 INFO  [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (default task-63) [3610c5f5] Candidate host 'dell-r210ii-03' ('320d4f9c-4360-40bb-af7e-d93dcc87401b') was filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'Memory' (correlation id: null)
2016-07-22 14:13:08,649 INFO  [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (default task-63) [3610c5f5] Candidate host 'dell-r210ii-03' ('320d4f9c-4360-40bb-af7e-d93dcc87401b') was filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'Memory' (correlation id: null)
2016-07-22 14:13:08,649 WARN  [org.ovirt.engine.core.bll.MigrateVmCommand] (default task-63) [3610c5f5] Validation of action 'MigrateVm' failed for user admin@internal-authz. Reasons: VAR__ACTION__MIGRATE,VAR__TYPE__VM,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName dell-r210ii-03,$filterName Memory,$availableMem 3151.000000,VAR__DETAIL__NOT_ENOUGH_MEMORY,SCHEDULING_HOST_FILTERED_REASON_WITH_DETAIL,VAR__FILTERTYPE__INTERNAL,$hostName dell-r210ii-04,$filterName HA,VAR__DETAIL__NOT_HE_HOST,SCHEDULING_HOST_FILTERED_REASON_WITH_DETAIL

                         ^^ uh?

2016-07-22 14:13:08,649 WARN  [org.ovirt.engine.core.bll.MigrateVmCommand] (default task-63) [3610c5f5] Validation of action 'MigrateVm' failed for user admin@internal-authz. Reasons: VAR__ACTION__MIGRATE,VAR__TY:
...
~~~
Comment 4 Jiri Belka 2016-07-22 10:24:52 EDT
My env was SHE, migrated from oVirt 3.6 to 4.0-snapshot but DC/CL compat was still 3.6 on 4.x engine.

ovirt-engine-4.0.3-0.0.master.20160720203246.git9c88731.el7.centos.noarch

my hosts:

oVirt Node Hypervisor release 3.6 (0.999.201607211021.el7.centos)
ovirt-release-host-node-4.0.2-1.el7.noarch
Comment 5 Yanir Quinn 2016-09-28 08:16:23 EDT
1.does the cluster have any scheduling policy (different from none)?
2.how was the migration done ? via the UI ? REST ? load balancing ?
Comment 6 Jiri Belka 2016-10-05 05:54:02 EDT
(In reply to Yanir from comment #5)
> 1.does the cluster have any scheduling policy (different from none)?
> 2.how was the migration done ? via the UI ? REST ? load balancing ?

My memory is little bit dim after two months since submitting this issue.

1. Nope
2. Via 'Migrate' button in UI

There are logs as attachments.
Comment 7 Yanir Quinn 2016-10-31 06:13:02 EDT
Regarding https://bugzilla.redhat.com/show_bug.cgi?id=1314377#c2 :

1.the source host appears in log messages because all hosts are taken into computation of the target host (unless you choose a specific host you want to migrate to)
2.the memory Debug messages gives you statistics regarding the memory state of each host, memory state is not the issue of filtering out a host (note again that these are debug messages) 

now for the main reason of this bug : 
"  Cannot migrate VM. There is no host that satisfies current scheduling constraints. See below for details:
    The host _slot-5b did not satisfy internal filter HA because it is not a Hosted Engine host.."

The reason for the score being zero (thus filtering it out as a not hosted engine host) might be that the engine hasn't noticed the deployment yet or something failed on the host. 

Since the score comes from VDSM we can add to the UI or Debug log message something like :  "... if it is a designated hosted engine host - check host status"    (this can be done by logging into any hosted engine host and issuing hosted-engine --vm-status command.) 

Other option is to leave the log message as it is.
Comment 8 Yanir Quinn 2016-10-31 06:37:41 EDT
This can be considered as a documentation bug.
Comment 9 Martin Sivák 2016-11-03 11:24:05 EDT
The hosted engine score was 0, because the host was not ready:

MainThread::INFO::2016-03-03 14:10:46,292::hosted_engine::875::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
MainThread::ERROR::2016-03-03 14:10:46,292::hosted_engine::845::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_domain_monitor) Failed to start monitoring domain (sd_uuid=23c03bb6-9889-4cbf-b7ad-55b9a2c70653, host_id=2): timeout during domain acquisition
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 452, in start_monitoring
    self._initialize_domain_monitor()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 846, in _initialize_domain_monitor
    raise Exception(msg)
Exception: Failed to start monitoring domain (sd_uuid=23c03bb6-9889-4cbf-b7ad-55b9a2c70653, host_id=2): timeout during domain acquisition

VDSM took a long time to synchronize its internals.
Comment 10 Martin Sivák 2016-11-03 11:26:02 EDT
This bug can be now considered just an error message update bug. We should distinguish between non-HE host and not operational, but HE enabled host in the scheduling result.

Note You need to log in before you can comment on or make changes to this bug.